Our main aim is to do Customer Segmentation by K-Means Clustering with the help of a transactional dataset between 2022 and 2023. We have also done Explotary Data Analysis (EDA) like descriptive statistics and visualization of dataset and Brand Recommendation Machine which will recommend top brand(s) if a customer search for an item name.
Why K-means is a popular choice for Customer Segmentation:
Unsupervised Learning: Unlike supervised learning, where models learn from labeled data (e.g., spam vs. non-spam), unsupervised learning identifies patterns in data without labeled examples. In customer segmentation, we often lack labeled data, making unsupervised methods like K-means valuable.
Customer Segmentation: K-means is ideal for creating clearly defined customer segments based on various features. These features might include recency, frequency, monetary value, or product preferences. The model assigns each customer to a cluster, allowing personalized marketing and service strategies.
How K-means Works: The “k” in K-means represents the number of clusters to create. The algorithm starts by randomly placing “centroids” (initial cluster centers). It then assigns each observation (customer) to the nearest centroid. The centroids are updated by calculating the mean of observations within each cluster. This process continues until further improvement is not possible (minimizing within-cluster sum of square distances).
K-means clustering allows businesses to create meaningful customer segments, tailor marketing efforts, and enhance customer experiences.
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
import plotly.express as px
from scipy.stats import linregress
from sklearn.ensemble import IsolationForest
from matplotlib.colors import LinearSegmentedColormap
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
import matplotlib.gridspec as gridspec
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from collections import Counter
import plotly.graph_objects as go
from tabulate import tabulate
%matplotlib inline
df = pd.read_csv("online_shopping_dataset.csv")
df.head()
| UserID | Age | Gender | Location | Item Name | Brand | Price | Rating | Number of Reviews | Purchase History | Search History | Wishlist | Cart Items | Click History | Transaction No. | Transaction Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User14 | 40 | Female | Russia | Watch | Henry-Johnson | 205.90 | 2.8 | 624 | True | False | False | False | False | 1 | 2023-07-08 00:00:00 |
| 1 | User8 | 50 | Male | Russia | T-Shirt | Lawson-Tucker | 132.02 | 3.1 | 794 | False | False | False | True | True | 2 | 2023-03-14 00:00:00 |
| 2 | User2 | 59 | Male | India | Headphones | Gomez LLC | 384.34 | 1.4 | 619 | False | False | False | False | False | 3 | 2022-10-17 00:00:00 |
| 3 | User6 | 48 | Female | Argentina | T-Shirt | Morris-Miller | 341.77 | 3.0 | 336 | True | True | True | False | True | 4 | 2022-04-02 00:00:00 |
| 4 | User5 | 49 | Female | Singapore | Bag | Waters-Hall | 309.58 | 3.6 | 973 | True | True | True | False | False | 5 | 2022-01-29 00:00:00 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UserID 1000 non-null object 1 Age 1000 non-null int64 2 Gender 1000 non-null object 3 Location 1000 non-null object 4 Item Name 1000 non-null object 5 Brand 1000 non-null object 6 Price 1000 non-null float64 7 Rating 1000 non-null float64 8 Number of Reviews 1000 non-null int64 9 Purchase History 1000 non-null bool 10 Search History 1000 non-null bool 11 Wishlist 1000 non-null bool 12 Cart Items 1000 non-null bool 13 Click History 1000 non-null bool 14 Transaction No. 1000 non-null int64 15 Transaction Date 1000 non-null object dtypes: bool(5), float64(2), int64(3), object(6) memory usage: 90.9+ KB
# Summary statistics for numerical variables
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 1000.0 | 41.58200 | 13.829480 | 18.00 | 30.000 | 42.00 | 54.0000 | 65.00 |
| Price | 1000.0 | 254.49653 | 140.606560 | 10.16 | 138.135 | 248.07 | 377.1825 | 499.91 |
| Rating | 1000.0 | 2.98440 | 1.137132 | 1.00 | 2.000 | 2.90 | 3.9000 | 5.00 |
| Number of Reviews | 1000.0 | 480.55900 | 280.554782 | 1.00 | 242.500 | 465.00 | 703.2500 | 998.00 |
| Transaction No. | 1000.0 | 500.50000 | 288.819436 | 1.00 | 250.750 | 500.50 | 750.2500 | 1000.00 |
# Summary statistics for categorical variables
df.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| UserID | 1000 | 20 | User8 | 63 |
| Gender | 1000 | 2 | Male | 530 |
| Location | 1000 | 18 | Greece | 74 |
| Item Name | 1000 | 10 | T-Shirt | 124 |
| Brand | 1000 | 969 | Smith Inc | 3 |
| Transaction Date | 1000 | 548 | 2022-10-23 00:00:00 | 6 |
df.isnull().sum()
UserID 0 Age 0 Gender 0 Location 0 Item Name 0 Brand 0 Price 0 Rating 0 Number of Reviews 0 Purchase History 0 Search History 0 Wishlist 0 Cart Items 0 Click History 0 Transaction No. 0 Transaction Date 0 dtype: int64
No missing values in any columns
# Finding duplicate rows
duplicate_rows = df[df.duplicated(keep=False)]
# Displaying the duplicate rows
duplicate_rows.head()
| UserID | Age | Gender | Location | Item Name | Brand | Price | Rating | Number of Reviews | Purchase History | Search History | Wishlist | Cart Items | Click History | Transaction No. | Transaction Date |
|---|
# Displaying the number of duplicate rows
print(f"The dataset contains {df.duplicated().sum()} duplicate rows that need to be removed.")
The dataset contains 0 duplicate rows that need to be removed.
df
| UserID | Age | Gender | Location | Item Name | Brand | Price | Rating | Number of Reviews | Purchase History | Search History | Wishlist | Cart Items | Click History | Transaction No. | Transaction Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User14 | 40 | Female | Russia | Watch | Henry-Johnson | 205.90 | 2.8 | 624 | True | False | False | False | False | 1 | 2023-07-08 00:00:00 |
| 1 | User8 | 50 | Male | Russia | T-Shirt | Lawson-Tucker | 132.02 | 3.1 | 794 | False | False | False | True | True | 2 | 2023-03-14 00:00:00 |
| 2 | User2 | 59 | Male | India | Headphones | Gomez LLC | 384.34 | 1.4 | 619 | False | False | False | False | False | 3 | 2022-10-17 00:00:00 |
| 3 | User6 | 48 | Female | Argentina | T-Shirt | Morris-Miller | 341.77 | 3.0 | 336 | True | True | True | False | True | 4 | 2022-04-02 00:00:00 |
| 4 | User5 | 49 | Female | Singapore | Bag | Waters-Hall | 309.58 | 3.6 | 973 | True | True | True | False | False | 5 | 2022-01-29 00:00:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | User3 | 25 | Male | Belgium | Watch | Jimenez, Noble and Nicholson | 372.24 | 4.7 | 320 | False | True | True | False | False | 996 | 2022-11-02 00:00:00 |
| 996 | User19 | 42 | Male | UK | Shoes | Jenkins-Williams | 27.06 | 4.1 | 239 | False | True | False | False | False | 997 | 2023-09-08 00:00:00 |
| 997 | User7 | 42 | Female | Singapore | T-Shirt | Leonard, Phillips and Scott | 148.12 | 3.5 | 449 | False | False | False | True | False | 998 | 2023-01-11 00:00:00 |
| 998 | User11 | 56 | Male | Japan | Perfume | Coleman-Savage | 481.67 | 3.6 | 419 | False | True | True | True | True | 999 | 2022-07-02 00:00:00 |
| 999 | User1 | 52 | Female | India | Bag | Wright-Richards | 427.96 | 1.9 | 25 | True | False | False | False | True | 1000 | 2023-01-20 00:00:00 |
1000 rows × 16 columns
As in the above dataset we can see that there is a mistake, i.e. if the Purchase History is False then for what the transaction is doing and transaction no. is showing.
So I will now assign 0 to all Transaction No. which respective Purchase History is False
# Convert 'Purchase History' column to boolean values
df['Purchase History'] = df['Purchase History'].astype(bool)
# Count 'Transaction No.' as 0 if 'Purchase History' is False
df['Transaction No.'] = df.apply(lambda row: 0 if not row['Purchase History'] else row['Transaction No.'], axis=1)
df
| UserID | Age | Gender | Location | Item Name | Brand | Price | Rating | Number of Reviews | Purchase History | Search History | Wishlist | Cart Items | Click History | Transaction No. | Transaction Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User14 | 40 | Female | Russia | Watch | Henry-Johnson | 205.90 | 2.8 | 624 | True | False | False | False | False | 1 | 2023-07-08 00:00:00 |
| 1 | User8 | 50 | Male | Russia | T-Shirt | Lawson-Tucker | 132.02 | 3.1 | 794 | False | False | False | True | True | 0 | 2023-03-14 00:00:00 |
| 2 | User2 | 59 | Male | India | Headphones | Gomez LLC | 384.34 | 1.4 | 619 | False | False | False | False | False | 0 | 2022-10-17 00:00:00 |
| 3 | User6 | 48 | Female | Argentina | T-Shirt | Morris-Miller | 341.77 | 3.0 | 336 | True | True | True | False | True | 4 | 2022-04-02 00:00:00 |
| 4 | User5 | 49 | Female | Singapore | Bag | Waters-Hall | 309.58 | 3.6 | 973 | True | True | True | False | False | 5 | 2022-01-29 00:00:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | User3 | 25 | Male | Belgium | Watch | Jimenez, Noble and Nicholson | 372.24 | 4.7 | 320 | False | True | True | False | False | 0 | 2022-11-02 00:00:00 |
| 996 | User19 | 42 | Male | UK | Shoes | Jenkins-Williams | 27.06 | 4.1 | 239 | False | True | False | False | False | 0 | 2023-09-08 00:00:00 |
| 997 | User7 | 42 | Female | Singapore | T-Shirt | Leonard, Phillips and Scott | 148.12 | 3.5 | 449 | False | False | False | True | False | 0 | 2023-01-11 00:00:00 |
| 998 | User11 | 56 | Male | Japan | Perfume | Coleman-Savage | 481.67 | 3.6 | 419 | False | True | True | True | True | 0 | 2022-07-02 00:00:00 |
| 999 | User1 | 52 | Female | India | Bag | Wright-Richards | 427.96 | 1.9 | 25 | True | False | False | False | True | 1000 | 2023-01-20 00:00:00 |
1000 rows × 16 columns
Descriptive Statistics
# Numerical columns
numerical_cols = ['Age', 'Price', 'Rating', 'Number of Reviews']
descriptive_stats_numerical = df[numerical_cols].describe()
descriptive_stats_numerical
| Age | Price | Rating | Number of Reviews | |
|---|---|---|---|---|
| count | 1000.00000 | 1000.00000 | 1000.000000 | 1000.000000 |
| mean | 41.58200 | 254.49653 | 2.984400 | 480.559000 |
| std | 13.82948 | 140.60656 | 1.137132 | 280.554782 |
| min | 18.00000 | 10.16000 | 1.000000 | 1.000000 |
| 25% | 30.00000 | 138.13500 | 2.000000 | 242.500000 |
| 50% | 42.00000 | 248.07000 | 2.900000 | 465.000000 |
| 75% | 54.00000 | 377.18250 | 3.900000 | 703.250000 |
| max | 65.00000 | 499.91000 | 5.000000 | 998.000000 |
Data Visualization
# Histograms for numerical variables
for col in numerical_cols:
plt.figure(figsize=(8, 6))
plt.hist(df[col], bins=10, edgecolor='black')
plt.title(f'Histogram of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
Bin width = (max value - min value) / number of bins
Bin width for Age = (65 - 18) / 10 = 4.7 (approximated to 5 for simplicity)
Bin 1: 18 - 22
Bin 2: 23 - 27
Bin 3: 28 - 32
Bin 4: 33 - 37
Bin 5: 38 - 42
Bin 6: 43 - 47
Bin 7: 48 - 52
Bin 8: 53 - 57
Bin 9: 58 - 62
Bin 10: 63 - 67
bin1 = df[df['Age'] <= 22]['UserID'].count()
print("Number of UserID with Age below or equal to 22:", bin1)
Number of UserID with Age below or equal to 22: 101
bin2 = df[(df['Age'] >= 23) & (df['Age'] <= 27)]['UserID'].count()
print("Number of UserID with Age more or equal to 23 to less or equal to 27:", bin2)
Number of UserID with Age more or equal to 23 to less or equal to 27: 98
Descriptive Statistics
# Categorical columns
categorical_cols = ['Gender', 'Location', 'Item Name', 'Brand', 'Purchase History', 'Search History', 'Wishlist', 'Cart Items', 'Click History']
descriptive_stats_categorical = df[categorical_cols].describe()
descriptive_stats_categorical
| Gender | Location | Item Name | Brand | Purchase History | Search History | Wishlist | Cart Items | Click History | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
| unique | 2 | 18 | 10 | 969 | 2 | 2 | 2 | 2 | 2 |
| top | Male | Greece | T-Shirt | Smith Inc | False | True | True | True | True |
| freq | 530 | 74 | 124 | 3 | 501 | 515 | 506 | 522 | 514 |
# Count the frequency of unique values in categorical columns
frequency_categorical = {col: df[col].value_counts() for col in categorical_cols}
frequency_categorical
{'Gender': Male 530
Female 470
Name: Gender, dtype: int64,
'Location': Greece 74
Russia 71
India 65
New Zealand 62
Japan 61
USA 59
Brazil 58
UK 55
Singapore 55
France 54
Argentina 53
Belgium 52
Switzerland 50
Italy 49
Australia 49
Canada 46
Spain 44
UAE 43
Name: Location, dtype: int64,
'Item Name': T-Shirt 124
Smartphone 115
Shoes 109
Headphones 104
Laptop 99
Bag 96
Jeans 93
Perfume 93
Sunglasses 86
Watch 81
Name: Item Name, dtype: int64,
'Brand': Smith Inc 3
Martin Group 3
Hernandez Ltd 3
Martin and Sons 3
Smith Group 3
..
Anderson, Mcpherson and Miller 1
Johnson, Murphy and Ruiz 1
Ryan, Hudson and Best 1
Osborne, Cooper and Green 1
Wright-Richards 1
Name: Brand, Length: 969, dtype: int64,
'Purchase History': False 501
True 499
Name: Purchase History, dtype: int64,
'Search History': True 515
False 485
Name: Search History, dtype: int64,
'Wishlist': True 506
False 494
Name: Wishlist, dtype: int64,
'Cart Items': True 522
False 478
Name: Cart Items, dtype: int64,
'Click History': True 514
False 486
Name: Click History, dtype: int64}
Data Visualization
# Bar plots and count plots for categorical variables
for col in categorical_cols:
plt.figure(figsize=(10, 6))
if len(df[col].unique()) <= 10: # For variables with fewer unique values, use countplot
sns.countplot(data=df, x=col)
plt.title(f'Count Plot of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.xticks(rotation=45)
else: # For variables with more than 10 unique values, use bar plot
df[col].value_counts().plot(kind='bar')
plt.title(f'Bar Plot of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
df
| UserID | Age | Gender | Location | Item Name | Brand | Price | Rating | Number of Reviews | Purchase History | Search History | Wishlist | Cart Items | Click History | Transaction No. | Transaction Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User14 | 40 | Female | Russia | Watch | Henry-Johnson | 205.90 | 2.8 | 624 | True | False | False | False | False | 1 | 2023-07-08 00:00:00 |
| 1 | User8 | 50 | Male | Russia | T-Shirt | Lawson-Tucker | 132.02 | 3.1 | 794 | False | False | False | True | True | 0 | 2023-03-14 00:00:00 |
| 2 | User2 | 59 | Male | India | Headphones | Gomez LLC | 384.34 | 1.4 | 619 | False | False | False | False | False | 0 | 2022-10-17 00:00:00 |
| 3 | User6 | 48 | Female | Argentina | T-Shirt | Morris-Miller | 341.77 | 3.0 | 336 | True | True | True | False | True | 4 | 2022-04-02 00:00:00 |
| 4 | User5 | 49 | Female | Singapore | Bag | Waters-Hall | 309.58 | 3.6 | 973 | True | True | True | False | False | 5 | 2022-01-29 00:00:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | User3 | 25 | Male | Belgium | Watch | Jimenez, Noble and Nicholson | 372.24 | 4.7 | 320 | False | True | True | False | False | 0 | 2022-11-02 00:00:00 |
| 996 | User19 | 42 | Male | UK | Shoes | Jenkins-Williams | 27.06 | 4.1 | 239 | False | True | False | False | False | 0 | 2023-09-08 00:00:00 |
| 997 | User7 | 42 | Female | Singapore | T-Shirt | Leonard, Phillips and Scott | 148.12 | 3.5 | 449 | False | False | False | True | False | 0 | 2023-01-11 00:00:00 |
| 998 | User11 | 56 | Male | Japan | Perfume | Coleman-Savage | 481.67 | 3.6 | 419 | False | True | True | True | True | 0 | 2022-07-02 00:00:00 |
| 999 | User1 | 52 | Female | India | Bag | Wright-Richards | 427.96 | 1.9 | 25 | True | False | False | False | True | 1000 | 2023-01-20 00:00:00 |
1000 rows × 16 columns
Feature Engineering helps to make a Customer centric dataset for K-Means Clustering.
RFM Features:
RFM is a method used for analyzing customer value and segmenting the customer base.
Recency (R)
Recency (R): This metric denotes the duration since the customer's last purchase. A lower figure suggests recent purchases, indicating higher engagement with the business. Conversely, a higher value may imply decreased engagement or a lapse. By gauging purchase recency, businesses can customize marketing approaches to re-engage customers who haven't made recent purchases, potentially enhancing retention and fostering loyalty.
# Convert Transaction Date to datetime type
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
# Convert Transaction Date to datetime and extract only the date
df['Transaction Day'] = df['Transaction Date'].dt.date
# Find the most recent purchase date for each customer
customer_engagement = df.groupby('UserID')['Transaction Day'].max().reset_index()
# Find the most recent date in the entire dataset
most_recent_date = df['Transaction Day'].max()
# Convert Transaction Day to datetime type before subtraction
customer_engagement['Transaction Day'] = pd.to_datetime(customer_engagement['Transaction Day'])
most_recent_date = pd.to_datetime(most_recent_date)
# Calculate the number of days since the last purchase for each customer
customer_engagement['Days_Since_Last_Purchase'] = (most_recent_date - customer_engagement['Transaction Day']).dt.days
# Remove the Transaction Day column
customer_engagement.drop(columns=['Transaction Day'], inplace=True)
customer_engagement
| UserID | Days_Since_Last_Purchase | |
|---|---|---|
| 0 | User1 | 8 |
| 1 | User10 | 18 |
| 2 | User11 | 8 |
| 3 | User12 | 2 |
| 4 | User13 | 27 |
| 5 | User14 | 13 |
| 6 | User15 | 2 |
| 7 | User16 | 5 |
| 8 | User17 | 3 |
| 9 | User18 | 9 |
| 10 | User19 | 0 |
| 11 | User2 | 31 |
| 12 | User20 | 9 |
| 13 | User3 | 1 |
| 14 | User4 | 19 |
| 15 | User5 | 26 |
| 16 | User6 | 1 |
| 17 | User7 | 2 |
| 18 | User8 | 7 |
| 19 | User9 | 11 |
Frequency (F)
It helps in evaluating the level of engagement of the customer.
Total Transactions: This metric reflects the overall number of transactions completed by a customer.
Total Products Purchased: This metric denotes the cumulative quantity of products bought by a customer.
Total Products Searched: This metric denotes the cumulative quantity of products searched by a customer.
Total Products in Wishlist: This metric denotes the cumulative quantity of products wishlisted by a customer.
Total Products in Cart: This metric denotes the cumulative quantity of products in cart by a customer.
Total Products in Click History: This metric denotes the cumulative quantity of products in click history by a customer.
A higher frequency value indicates a customer's higher loyalty or satisfaction and also vice-versa.
# Filter out rows where 'Transaction No.' is not 0
filtered_df = df[df['Transaction No.'] != 0]
# Find total transactions for each customer
total_transactions = filtered_df.groupby('UserID')['Transaction No.'].nunique()
# Rename the column to 'Total Transactions'
total_transactions = total_transactions.reset_index().rename(columns={'Transaction No.': 'Total Transactions'})
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, total_transactions, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | |
|---|---|---|---|
| 0 | User1 | 8 | 20 |
| 1 | User10 | 18 | 29 |
| 2 | User11 | 8 | 26 |
| 3 | User12 | 2 | 17 |
| 4 | User13 | 27 | 35 |
| 5 | User14 | 13 | 31 |
| 6 | User15 | 2 | 30 |
| 7 | User16 | 5 | 21 |
| 8 | User17 | 3 | 23 |
| 9 | User18 | 9 | 19 |
| 10 | User19 | 0 | 22 |
| 11 | User2 | 31 | 26 |
| 12 | User20 | 9 | 30 |
| 13 | User3 | 1 | 30 |
| 14 | User4 | 19 | 24 |
| 15 | User5 | 26 | 23 |
| 16 | User6 | 1 | 23 |
| 17 | User7 | 2 | 22 |
| 18 | User8 | 7 | 26 |
| 19 | User9 | 11 | 22 |
# Assuming Purchase History column contains boolean values
# Convert True/False to 1/0
df['Purchase History'] = df['Purchase History'].astype(int)
# Group by UserID and sum the Purchase History column
Total_Products_Purchased = df.groupby('UserID')['Purchase History'].sum().reset_index()
# Rename the column to Total Products Purchased
Total_Products_Purchased.rename(columns={'Purchase History': 'Total Products Purchased'}, inplace=True)
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, Total_Products_Purchased, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | |
|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 |
| 1 | User10 | 18 | 29 | 29 |
| 2 | User11 | 8 | 26 | 26 |
| 3 | User12 | 2 | 17 | 17 |
| 4 | User13 | 27 | 35 | 35 |
| 5 | User14 | 13 | 31 | 31 |
| 6 | User15 | 2 | 30 | 30 |
| 7 | User16 | 5 | 21 | 21 |
| 8 | User17 | 3 | 23 | 23 |
| 9 | User18 | 9 | 19 | 19 |
| 10 | User19 | 0 | 22 | 22 |
| 11 | User2 | 31 | 26 | 26 |
| 12 | User20 | 9 | 30 | 30 |
| 13 | User3 | 1 | 30 | 30 |
| 14 | User4 | 19 | 24 | 24 |
| 15 | User5 | 26 | 23 | 23 |
| 16 | User6 | 1 | 23 | 23 |
| 17 | User7 | 2 | 22 | 22 |
| 18 | User8 | 7 | 26 | 26 |
| 19 | User9 | 11 | 22 | 22 |
# Assuming Search History column contains boolean values
# Convert True/False to 1/0
df['Search History'] = df['Search History'].astype(int)
# Group by UserID and sum the Search History column
Total_Products_Searched = df.groupby('UserID')['Search History'].sum().reset_index()
# Rename the column to Total Products Searched
Total_Products_Searched.rename(columns={'Search History': 'Total Products Searched'}, inplace=True)
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, Total_Products_Searched, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | |
|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 |
| 1 | User10 | 18 | 29 | 29 | 27 |
| 2 | User11 | 8 | 26 | 26 | 25 |
| 3 | User12 | 2 | 17 | 17 | 19 |
| 4 | User13 | 27 | 35 | 35 | 33 |
| 5 | User14 | 13 | 31 | 31 | 21 |
| 6 | User15 | 2 | 30 | 30 | 28 |
| 7 | User16 | 5 | 21 | 21 | 23 |
| 8 | User17 | 3 | 23 | 23 | 26 |
| 9 | User18 | 9 | 19 | 19 | 24 |
| 10 | User19 | 0 | 22 | 22 | 35 |
| 11 | User2 | 31 | 26 | 26 | 31 |
| 12 | User20 | 9 | 30 | 30 | 22 |
| 13 | User3 | 1 | 30 | 30 | 30 |
| 14 | User4 | 19 | 24 | 24 | 20 |
| 15 | User5 | 26 | 23 | 23 | 20 |
| 16 | User6 | 1 | 23 | 23 | 22 |
| 17 | User7 | 2 | 22 | 22 | 18 |
| 18 | User8 | 7 | 26 | 26 | 36 |
| 19 | User9 | 11 | 22 | 22 | 30 |
# Assuming Wishlist column contains boolean values
# Convert True/False to 1/0
df['Wishlist'] = df['Wishlist'].astype(int)
# Group by UserID and sum the Wishlist column
Total_Products_in_Wishlist = df.groupby('UserID')['Wishlist'].sum().reset_index()
# Rename the column to Total Products in Wishlist
Total_Products_in_Wishlist.rename(columns={'Wishlist': 'Total Products in Wishlist'}, inplace=True)
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, Total_Products_in_Wishlist, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | |
|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 |
# Assuming Cart Items column contains boolean values
# Convert True/False to 1/0
df['Cart Items'] = df['Cart Items'].astype(int)
# Group by UserID and sum the Cart Items column
Total_Products_in_Cart = df.groupby('UserID')['Cart Items'].sum().reset_index()
# Rename the column to Total Products in Cart
Total_Products_in_Cart.rename(columns={'Cart Items': 'Total Products in Cart'}, inplace=True)
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, Total_Products_in_Cart, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | |
|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 |
# Assuming Cart Items column contains boolean values
# Convert True/False to 1/0
df['Click History'] = df['Click History'].astype(int)
# Group by UserID and sum the Click History column
Total_Products_in_Click_History = df.groupby('UserID')['Click History'].sum().reset_index()
# Rename the column to Total Products in Click History
Total_Products_in_Click_History.rename(columns={'Click History': 'Total Products in Click History'}, inplace=True)
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, Total_Products_in_Click_History, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | |
|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 |
Monetary (M)
Total Spend: This metric reflects the total amount of money spent by a customer.
Willing to Spend: This metric reflects the total amount of money willing to spent by a customer.
Average Transaction Value: Total Spend / Total Transactions
Customers who have a higher monetary value have contributed more to the business, indicating their potential high lifetime value and also vice-versa.
# Calculating total spend for each customer
df['Total Spend'] = df['Price'] * df['Purchase History']
# Grouping by UserID and summing up total spend
total_spend = df.groupby('UserID')['Total Spend'].sum()
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, total_spend, on='UserID')
# Drop the 'Total Spend' column from the original dataframe 'df'
df.drop('Total Spend', axis=1, inplace=True)
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 |
# Calculating total amount willing to spend for each customer
df['Willing to Spend'] = df['Price'] * (df['Wishlist'] | df['Cart Items'])
# Grouping by UserID and summing up total spend
total_willing_to_spend = df.groupby('UserID')['Willing to Spend'].sum()
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, total_willing_to_spend, on='UserID')
# Drop the 'Willing to Spend' column from the original dataframe 'df'
df.drop('Willing to Spend', axis=1, inplace=True)
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 |
# Calculate average transaction value
customer_engagement['Average Transaction Value'] = customer_engagement['Total Spend'] / customer_engagement['Total Transactions']
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | Average Transaction Value | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | 288.829500 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | 256.600345 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | 203.909231 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | 274.940588 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | 231.559714 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | 303.100323 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | 279.886667 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | 229.558095 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | 293.000000 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | 271.136842 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | 235.749545 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | 265.749231 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | 256.509000 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | 270.125333 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | 241.572917 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | 243.004783 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | 243.728261 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | 251.193182 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | 235.449615 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | 224.772727 |
Product Diversity:
Product purchase behavior of customers.
Unique Products Purchased: This metric reflects the unique products purchased by a customer.
Unique Products to be Purchased: This metric reflects the unique products to be purchased by a customer.
A higher figure suggests diverse tastes or preferences, encompassing a broad array of products and vice-versa.
# Filter rows with Purchase History as 1 (True)
df_filtered = df[df['Purchase History'] == 1]
# Count unique products purchased by each customer
unique_products_purchased = df_filtered.groupby('UserID')['Item Name'].nunique()
# Rename the resulting Series to Unique Products Purchased
unique_products_purchased = unique_products_purchased.rename('Unique Products Purchased')
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, unique_products_purchased, on='UserID')
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | Average Transaction Value | Unique Products Purchased | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | 288.829500 | 10 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | 256.600345 | 10 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | 203.909231 | 9 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | 274.940588 | 6 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | 231.559714 | 10 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | 303.100323 | 9 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | 279.886667 | 10 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | 229.558095 | 9 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | 293.000000 | 8 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | 271.136842 | 8 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | 235.749545 | 10 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | 265.749231 | 8 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | 256.509000 | 10 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | 270.125333 | 10 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | 241.572917 | 9 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | 243.004783 | 10 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | 243.728261 | 9 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | 251.193182 | 9 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | 235.449615 | 10 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | 224.772727 | 8 |
# Combine Wishlist and Cart Items into one column
df['Willing to Purchase'] = df['Wishlist'] | df['Cart Items']
# Filter rows where the customer is willing to purchase
df_willing_to_purchase = df[df['Willing to Purchase'] == 1]
# Count unique products the customer is willing to purchase
unique_products_willing_to_purchase = df_willing_to_purchase.groupby('UserID')['Item Name'].nunique()
# Rename the resulting Series to Unique Products to be Purchased
unique_products_willing_to_purchase = unique_products_willing_to_purchase.rename('Unique Products to be Purchased')
# Merge the new features into the customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, unique_products_willing_to_purchase, on='UserID')
# Drop the 'Willing to Purchase' column from the original dataframe 'df'
df.drop('Willing to Purchase', axis=1, inplace=True)
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | Average Transaction Value | Unique Products Purchased | Unique Products to be Purchased | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | 288.829500 | 10 | 10 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | 256.600345 | 10 | 10 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | 203.909231 | 9 | 10 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | 274.940588 | 6 | 10 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | 231.559714 | 10 | 10 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | 303.100323 | 9 | 10 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | 279.886667 | 10 | 10 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | 229.558095 | 9 | 10 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | 293.000000 | 8 | 9 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | 271.136842 | 8 | 10 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | 235.749545 | 10 | 10 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | 265.749231 | 8 | 10 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | 256.509000 | 10 | 10 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | 270.125333 | 10 | 10 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | 241.572917 | 9 | 9 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | 243.004783 | 10 | 10 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | 243.728261 | 9 | 10 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | 251.193182 | 9 | 9 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | 235.449615 | 10 | 10 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | 224.772727 | 8 | 10 |
Demographic Features, Product based features & Geographic Features:
Demographic Features: This includes segmenting customers based on demographic characteristics like age, gender. This helps in understanding the different segments of customer base and tailor the marketing messages and product offerings to their specific needs and preferences.
Product-based Features: This involves segmenting customers based on the products they are interested in or have purchased. This helps in understanding which products are popular among different customer segments and how to position them in the marketing efforts.
Geographic Features: This reflects the geographical location of customers. Different regions might have varying preferences and purchasing behaviors which can be critical in personalizing marketing strategies and inventory planning.
df['Age'].value_counts(normalize=True)
47 0.031 55 0.030 43 0.028 65 0.028 61 0.027 28 0.027 40 0.026 30 0.025 18 0.025 33 0.025 21 0.024 46 0.024 53 0.024 49 0.023 25 0.023 39 0.022 42 0.022 34 0.022 56 0.021 58 0.021 29 0.021 54 0.021 19 0.021 48 0.020 27 0.020 62 0.020 37 0.020 38 0.020 23 0.019 41 0.019 24 0.019 60 0.019 52 0.019 31 0.018 51 0.018 63 0.018 32 0.018 35 0.018 36 0.018 50 0.017 26 0.017 22 0.016 64 0.016 57 0.016 20 0.015 59 0.015 44 0.014 45 0.010 Name: Age, dtype: float64
df['Gender'].value_counts(normalize=True)
Male 0.53 Female 0.47 Name: Gender, dtype: float64
df['Item Name'].value_counts(normalize=True)
T-Shirt 0.124 Smartphone 0.115 Shoes 0.109 Headphones 0.104 Laptop 0.099 Bag 0.096 Jeans 0.093 Perfume 0.093 Sunglasses 0.086 Watch 0.081 Name: Item Name, dtype: float64
df['Location'].value_counts(normalize=True)
Greece 0.074 Russia 0.071 India 0.065 New Zealand 0.062 Japan 0.061 USA 0.059 Brazil 0.058 UK 0.055 Singapore 0.055 France 0.054 Argentina 0.053 Belgium 0.052 Switzerland 0.050 Italy 0.049 Australia 0.049 Canada 0.046 Spain 0.044 UAE 0.043 Name: Location, dtype: float64
# Demographic Features
# Set age based on majority
majority_age = df['Age'].mode()[0]
# Count customers with the same age or above
customer_engagement['Same_Age_or_Above'] = df['Age'].apply(lambda x: 1 if x >= majority_age else 0)
# Count Male customers
customer_engagement['Is_Male'] = df['Gender'].apply(lambda x: 1 if x == 'Male' else 0)
# Product-based Features
# Set item name based on majority
majority_item = df['Item Name'].mode()[0]
# Count customers buying the same item
customer_engagement['Buying_Same_Item'] = df['Item Name'].apply(lambda x: 1 if x == majority_item else 0)
# Geographic Features
# Set location based on majority
majority_location = df['Location'].mode()[0]
# Count customers from the same location
customer_engagement['From_Same_Location'] = df['Location'].apply(lambda x: 1 if x == majority_location else 0)
majority_age
47
majority_item
'T-Shirt'
majority_location
'Greece'
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | Average Transaction Value | Unique Products Purchased | Unique Products to be Purchased | Same_Age_or_Above | Is_Male | Buying_Same_Item | From_Same_Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | 288.829500 | 10 | 10 | 0 | 0 | 0 | 0 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | 256.600345 | 10 | 10 | 1 | 1 | 1 | 0 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | 203.909231 | 9 | 10 | 1 | 1 | 0 | 0 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | 274.940588 | 6 | 10 | 1 | 0 | 1 | 0 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | 231.559714 | 10 | 10 | 1 | 0 | 0 | 0 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | 303.100323 | 9 | 10 | 1 | 0 | 1 | 0 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | 279.886667 | 10 | 10 | 0 | 0 | 0 | 0 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | 229.558095 | 9 | 10 | 0 | 0 | 0 | 0 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | 293.000000 | 8 | 9 | 0 | 0 | 0 | 1 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | 271.136842 | 8 | 10 | 0 | 1 | 1 | 0 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | 235.749545 | 10 | 10 | 0 | 0 | 1 | 0 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | 265.749231 | 8 | 10 | 0 | 0 | 0 | 0 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | 256.509000 | 10 | 10 | 0 | 1 | 0 | 0 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | 270.125333 | 10 | 10 | 0 | 0 | 0 | 0 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | 241.572917 | 9 | 9 | 1 | 1 | 0 | 0 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | 243.004783 | 10 | 10 | 0 | 0 | 0 | 0 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | 243.728261 | 9 | 10 | 0 | 0 | 0 | 0 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | 251.193182 | 9 | 9 | 0 | 0 | 0 | 0 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | 235.449615 | 10 | 10 | 0 | 0 | 0 | 0 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | 224.772727 | 8 | 10 | 0 | 1 | 1 | 0 |
Behavioral Features:
Our objective is to grasp and record customers' shopping patterns and behaviors, providing insights into their preferences regarding the timing of their shopping activities.
Days Between Purchases: This metric indicates the average duration in days that a customer waits before making another purchase. This understanding can aid in predicting when customers are likely to make their next purchase, which is vital for targeted marketing strategies and personalized promotions.
# Assuming 'Transaction Date' is in datetime format
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
# Group transactions by UserID and calculate the time difference between consecutive purchases
df['Days_Between_Purchases'] = df.groupby('UserID')['Transaction Date'].diff().dt.days
# Calculate average days between purchases for each user
average_days_between_purchases = df.groupby('UserID')['Days_Between_Purchases'].mean().reset_index()
# Merge the result with customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, average_days_between_purchases, on='UserID')
# Drop the 'Days_Between_Purchases' column from the original dataframe 'df'
df.drop('Days_Between_Purchases', axis=1, inplace=True)
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | Average Transaction Value | Unique Products Purchased | Unique Products to be Purchased | Same_Age_or_Above | Is_Male | Buying_Same_Item | From_Same_Location | Days_Between_Purchases | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | 288.829500 | 10 | 10 | 0 | 0 | 0 | 0 | 4.448980 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | 256.600345 | 10 | 10 | 1 | 1 | 1 | 0 | 8.655172 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | 203.909231 | 9 | 10 | 1 | 1 | 0 | 0 | -0.980000 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | 274.940588 | 6 | 10 | 1 | 0 | 1 | 0 | -5.461538 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | 231.559714 | 10 | 10 | 1 | 0 | 0 | 0 | 3.550000 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | 303.100323 | 9 | 10 | 1 | 0 | 1 | 0 | -4.673913 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | 279.886667 | 10 | 10 | 0 | 0 | 0 | 0 | -3.981481 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | 229.558095 | 9 | 10 | 0 | 0 | 0 | 0 | 3.304348 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | 293.000000 | 8 | 9 | 0 | 0 | 0 | 1 | 5.725000 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | 271.136842 | 8 | 10 | 0 | 1 | 1 | 0 | 3.311111 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | 235.749545 | 10 | 10 | 0 | 0 | 1 | 0 | 2.350877 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | 265.749231 | 8 | 10 | 0 | 0 | 0 | 0 | 0.120000 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | 256.509000 | 10 | 10 | 0 | 1 | 0 | 0 | 5.282609 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | 270.125333 | 10 | 10 | 0 | 0 | 0 | 0 | -1.020000 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | 241.572917 | 9 | 9 | 1 | 1 | 0 | 0 | 2.809524 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | 243.004783 | 10 | 10 | 0 | 0 | 0 | 0 | 0.186047 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | 243.728261 | 9 | 10 | 0 | 0 | 0 | 0 | 3.163265 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | 251.193182 | 9 | 9 | 0 | 0 | 0 | 0 | -0.813953 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | 235.449615 | 10 | 10 | 0 | 0 | 0 | 0 | -5.725806 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | 224.772727 | 8 | 10 | 0 | 1 | 1 | 0 | -4.627451 |
Seasonality & Trends:
Monthly and Yearly Spending Averages: These metrics represent the mean amount a customer spends per month and year, respectively. They offer a glimpse into each customer's general spending habits. A higher mean suggests a propensity for higher spending, while a lower mean may reflect more budget-conscious behavior.
Monthly and Yearly Spending Variability: These features indicate the fluctuation in a customer's monthly and yearly spending. A higher value indicates significant month-to-month or year-to-year spending variability, potentially reflecting sporadic large purchases. Conversely, a lower value suggests more stable, consistent spending patterns. Understanding this variability can inform the timing of personalized promotions or discounts.
Spending Trend: This metric illustrates the trajectory of a customer's spending over time. A positive value signifies increasing spending, possibly indicating growing loyalty or satisfaction. Conversely, a negative trend may signal declining interest or satisfaction, prompting the need for re-engagement strategies. A near-zero value indicates stable spending habits. Recognizing these trends enables the development of strategies to maintain or alter customer spending patterns, thereby optimizing marketing campaigns.
# Assuming 'Transaction Date' is in datetime format
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])
# Extract month and year from Transaction Date
df['Month'] = df['Transaction Date'].dt.month
df['Year'] = df['Transaction Date'].dt.year
# Calculate monthly spending mean and standard deviation for each user
monthly_spending_stats = df.groupby(['UserID', 'Month'])['Price'].sum().groupby('UserID').agg(['mean', 'std']).reset_index()
monthly_spending_stats.columns = ['UserID', 'Monthly_Spending_Mean', 'Monthly_Spending_Std']
# Calculate yearly spending mean and standard deviation for each user
yearly_spending_stats = df.groupby(['UserID', 'Year'])['Price'].sum().groupby('UserID').agg(['mean', 'std']).reset_index()
yearly_spending_stats.columns = ['UserID', 'Yearly_Spending_Mean', 'Yearly_Spending_Std']
# Merge the results with customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, monthly_spending_stats, on='UserID', how='left')
customer_engagement = pd.merge(customer_engagement, yearly_spending_stats, on='UserID', how='left')
# Calculate spending trend for each user
def calculate_spending_trend(x):
slope, _, _, _, _ = linregress(np.arange(len(x)), x)
return slope
spending_trend = df.groupby('UserID')['Price'].apply(calculate_spending_trend).reset_index()
spending_trend.columns = ['UserID', 'Spending_Trend']
# Merge spending trend with customer_engagement dataframe
customer_engagement = pd.merge(customer_engagement, spending_trend, on='UserID')
# Plotting the spending trend
plt.figure(figsize=(15, 6))
plt.scatter(customer_engagement['UserID'], customer_engagement['Spending_Trend'], color='blue')
plt.title('Spending Trend of Customers')
plt.xlabel('User ID')
plt.ylabel('Spending Trend')
plt.grid(True)
plt.show()
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | ... | Same_Age_or_Above | Is_Male | Buying_Same_Item | From_Same_Location | Days_Between_Purchases | Monthly_Spending_Mean | Monthly_Spending_Std | Yearly_Spending_Mean | Yearly_Spending_Std | Spending_Trend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | ... | 0 | 0 | 0 | 0 | 4.448980 | 1204.861667 | 585.342892 | 7229.170 | 937.156901 | -0.684772 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | ... | 1 | 1 | 1 | 0 | 8.655172 | 1260.959167 | 585.678930 | 7565.755 | 4165.969099 | 0.617309 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | ... | 1 | 1 | 0 | 0 | -0.980000 | 1006.054167 | 570.867239 | 6036.325 | 1416.426807 | 3.460524 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | ... | 1 | 0 | 1 | 0 | -5.461538 | 901.659167 | 482.868280 | 5409.955 | 1166.789829 | -4.053605 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | ... | 1 | 0 | 0 | 0 | 3.550000 | 1230.004167 | 755.513942 | 7380.025 | 2622.001442 | -1.255994 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | ... | 1 | 0 | 1 | 0 | -4.673913 | 1107.518333 | 623.848447 | 6645.110 | 2353.010951 | -0.295626 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | ... | 0 | 0 | 0 | 0 | -3.981481 | 1284.206667 | 933.262190 | 7705.240 | 2028.547934 | -0.315243 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | ... | 0 | 0 | 0 | 0 | 3.304348 | 953.534167 | 492.675970 | 5721.205 | 264.224591 | 0.208918 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | ... | 0 | 0 | 0 | 1 | 5.725000 | 893.305000 | 423.069753 | 5359.830 | 386.193440 | -0.701129 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | ... | 0 | 1 | 1 | 0 | 3.311111 | 920.897500 | 347.307700 | 5525.385 | 773.398042 | 2.562159 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | ... | 0 | 0 | 1 | 0 | 2.350877 | 1216.821667 | 714.442483 | 7300.930 | 938.005430 | -0.528994 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | ... | 0 | 0 | 0 | 0 | 0.120000 | 1126.255000 | 749.061419 | 6757.530 | 759.347830 | -0.094429 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | ... | 0 | 1 | 0 | 0 | 5.282609 | 952.026667 | 623.558073 | 5712.160 | 599.216429 | 1.430459 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | ... | 0 | 0 | 0 | 0 | -1.020000 | 1095.616667 | 681.874269 | 6573.700 | 1107.046377 | -0.936123 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | ... | 1 | 1 | 0 | 0 | 2.809524 | 890.831667 | 440.318948 | 5344.990 | 785.270365 | 1.864607 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | ... | 0 | 0 | 0 | 0 | 0.186047 | 942.547500 | 510.093180 | 5655.285 | 912.033397 | 0.975988 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | ... | 0 | 0 | 0 | 0 | 3.163265 | 1005.869167 | 670.056435 | 6035.215 | 80.560676 | -1.027859 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | ... | 0 | 0 | 0 | 0 | -0.813953 | 812.149167 | 412.437383 | 4872.895 | 704.214714 | 0.975034 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | ... | 0 | 0 | 0 | 0 | -5.725806 | 1299.064167 | 533.605655 | 7794.385 | 1187.790900 | -0.172598 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | ... | 0 | 1 | 1 | 0 | -4.627451 | 1103.862500 | 632.628950 | 6623.175 | 1827.185136 | -0.027006 |
20 rows × 23 columns
customer_engagement.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 20 entries, 0 to 19 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UserID 20 non-null object 1 Days_Since_Last_Purchase 20 non-null int64 2 Total Transactions 20 non-null int64 3 Total Products Purchased 20 non-null int32 4 Total Products Searched 20 non-null int32 5 Total Products in Wishlist 20 non-null int32 6 Total Products in Cart 20 non-null int32 7 Total Products in Click History 20 non-null int32 8 Total Spend 20 non-null float64 9 Willing to Spend 20 non-null float64 10 Average Transaction Value 20 non-null float64 11 Unique Products Purchased 20 non-null int64 12 Unique Products to be Purchased 20 non-null int64 13 Same_Age_or_Above 20 non-null int64 14 Is_Male 20 non-null int64 15 Buying_Same_Item 20 non-null int64 16 From_Same_Location 20 non-null int64 17 Days_Between_Purchases 20 non-null float64 18 Monthly_Spending_Mean 20 non-null float64 19 Monthly_Spending_Std 20 non-null float64 20 Yearly_Spending_Mean 20 non-null float64 21 Yearly_Spending_Std 20 non-null float64 22 Spending_Trend 20 non-null float64 dtypes: float64(9), int32(5), int64(8), object(1) memory usage: 3.4+ KB
customer_engagement.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Days_Since_Last_Purchase | 20.0 | 10.100000 | 9.425051 | 0.000000 | 2.000000 | 8.000000 | 14.250000 | 31.000000 |
| Total Transactions | 20.0 | 24.950000 | 4.650693 | 17.000000 | 22.000000 | 23.500000 | 29.250000 | 35.000000 |
| Total Products Purchased | 20.0 | 24.950000 | 4.650693 | 17.000000 | 22.000000 | 23.500000 | 29.250000 | 35.000000 |
| Total Products Searched | 20.0 | 25.750000 | 5.369259 | 18.000000 | 21.750000 | 25.000000 | 30.000000 | 36.000000 |
| Total Products in Wishlist | 20.0 | 25.300000 | 4.508472 | 15.000000 | 22.750000 | 26.000000 | 28.000000 | 35.000000 |
| Total Products in Cart | 20.0 | 26.100000 | 5.179921 | 17.000000 | 23.000000 | 25.000000 | 30.000000 | 38.000000 |
| Total Products in Click History | 20.0 | 25.700000 | 5.232490 | 17.000000 | 22.000000 | 25.000000 | 27.000000 | 40.000000 |
| Total Spend | 20.0 | 6364.140000 | 1387.745692 | 4673.990000 | 5272.852500 | 5787.170000 | 7504.875000 | 9396.110000 |
| Willing to Spend | 20.0 | 9968.201000 | 1439.240817 | 7725.810000 | 9230.247500 | 9858.310000 | 10726.007500 | 13273.990000 |
| Average Transaction Value | 20.0 | 255.018795 | 25.616406 | 203.909231 | 235.674563 | 253.851091 | 272.087779 | 303.100323 |
| Unique Products Purchased | 20.0 | 9.100000 | 1.071153 | 6.000000 | 8.750000 | 9.000000 | 10.000000 | 10.000000 |
| Unique Products to be Purchased | 20.0 | 9.850000 | 0.366348 | 9.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 |
| Same_Age_or_Above | 20.0 | 0.300000 | 0.470162 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| Is_Male | 20.0 | 0.300000 | 0.470162 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| Buying_Same_Item | 20.0 | 0.300000 | 0.470162 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| From_Same_Location | 20.0 | 0.050000 | 0.223607 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| Days_Between_Purchases | 20.0 | 0.781139 | 4.139228 | -5.725806 | -1.760370 | 1.268462 | 3.370833 | 8.655172 |
| Monthly_Spending_Mean | 20.0 | 1060.402208 | 151.658603 | 812.149167 | 937.135000 | 1050.835417 | 1207.851667 | 1299.064167 |
| Monthly_Spending_Std | 20.0 | 588.425607 | 140.772095 | 347.307700 | 490.224047 | 585.510911 | 673.010893 | 933.262190 |
| Yearly_Spending_Mean | 20.0 | 6362.413250 | 909.951619 | 4872.895000 | 5622.810000 | 6305.012500 | 7247.110000 | 7794.385000 |
| Yearly_Spending_Std | 20.0 | 1250.719514 | 956.976961 | 80.560676 | 745.564551 | 937.581165 | 1519.116389 | 4165.969099 |
| Spending_Trend | 20.0 | 0.100081 | 1.582787 | -4.053605 | -0.688861 | -0.133513 | 0.975272 | 3.460524 |
df
| UserID | Age | Gender | Location | Item Name | Brand | Price | Rating | Number of Reviews | Purchase History | Search History | Wishlist | Cart Items | Click History | Transaction No. | Transaction Date | Transaction Day | Month | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User14 | 40 | Female | Russia | Watch | Henry-Johnson | 205.90 | 2.8 | 624 | 1 | 0 | 0 | 0 | 0 | 1 | 2023-07-08 | 2023-07-08 | 7 | 2023 |
| 1 | User8 | 50 | Male | Russia | T-Shirt | Lawson-Tucker | 132.02 | 3.1 | 794 | 0 | 0 | 0 | 1 | 1 | 0 | 2023-03-14 | 2023-03-14 | 3 | 2023 |
| 2 | User2 | 59 | Male | India | Headphones | Gomez LLC | 384.34 | 1.4 | 619 | 0 | 0 | 0 | 0 | 0 | 0 | 2022-10-17 | 2022-10-17 | 10 | 2022 |
| 3 | User6 | 48 | Female | Argentina | T-Shirt | Morris-Miller | 341.77 | 3.0 | 336 | 1 | 1 | 1 | 0 | 1 | 4 | 2022-04-02 | 2022-04-02 | 4 | 2022 |
| 4 | User5 | 49 | Female | Singapore | Bag | Waters-Hall | 309.58 | 3.6 | 973 | 1 | 1 | 1 | 0 | 0 | 5 | 2022-01-29 | 2022-01-29 | 1 | 2022 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | User3 | 25 | Male | Belgium | Watch | Jimenez, Noble and Nicholson | 372.24 | 4.7 | 320 | 0 | 1 | 1 | 0 | 0 | 0 | 2022-11-02 | 2022-11-02 | 11 | 2022 |
| 996 | User19 | 42 | Male | UK | Shoes | Jenkins-Williams | 27.06 | 4.1 | 239 | 0 | 1 | 0 | 0 | 0 | 0 | 2023-09-08 | 2023-09-08 | 9 | 2023 |
| 997 | User7 | 42 | Female | Singapore | T-Shirt | Leonard, Phillips and Scott | 148.12 | 3.5 | 449 | 0 | 0 | 0 | 1 | 0 | 0 | 2023-01-11 | 2023-01-11 | 1 | 2023 |
| 998 | User11 | 56 | Male | Japan | Perfume | Coleman-Savage | 481.67 | 3.6 | 419 | 0 | 1 | 1 | 1 | 1 | 0 | 2022-07-02 | 2022-07-02 | 7 | 2022 |
| 999 | User1 | 52 | Female | India | Bag | Wright-Richards | 427.96 | 1.9 | 25 | 1 | 0 | 0 | 0 | 1 | 1000 | 2023-01-20 | 2023-01-20 | 1 | 2023 |
1000 rows × 19 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UserID 1000 non-null object 1 Age 1000 non-null int64 2 Gender 1000 non-null object 3 Location 1000 non-null object 4 Item Name 1000 non-null object 5 Brand 1000 non-null object 6 Price 1000 non-null float64 7 Rating 1000 non-null float64 8 Number of Reviews 1000 non-null int64 9 Purchase History 1000 non-null int32 10 Search History 1000 non-null int32 11 Wishlist 1000 non-null int32 12 Cart Items 1000 non-null int32 13 Click History 1000 non-null int32 14 Transaction No. 1000 non-null int64 15 Transaction Date 1000 non-null datetime64[ns] 16 Transaction Day 1000 non-null object 17 Month 1000 non-null int64 18 Year 1000 non-null int64 dtypes: datetime64[ns](1), float64(2), int32(5), int64(5), object(6) memory usage: 129.0+ KB
Now the Feature Engineering process is completed. Variety of new features focussing on the customers have been created which will give a deeper understanding of their buying patterns and preferences.
Now the next step is checking and handling outliers and preparing our data for clustering.
Outliers are data points that deviate significantly from the majority of other observations in the dataset. They have the potential to distort the outcomes of our analysis, particularly in techniques like k-means clustering, where they can heavily influence the positioning of cluster centroids. Therefore, it's crucial to appropriately identify and handle these outliers to attain more precise and meaningful clustering results.
Visually Detecting Outliers through Box Plots
Interpreting Outliers:
Lower Boundary: Q1 - 1.5 × IQR
Upper Boundary: Q3 + 1.5 × IQR
where, Q1 is First Quartile
Q3 is third quartile
Interquartile range (IQR) is range btw Q1 to Q2 (which includes median)
minimum: 17.000000
Q1(25%): 22.000000
Median(50%): 25.000000
Q3(75%): 27.000000
maximum: 40.000000
Calculating boundaries:
IQR = Q3 - Q1 = 27.0 - 22.0 = 5.0
Lower Boundary: Q1 - 1.5 × IQR = 22.0 - 1.5 × 5.0 = 14.5
Upper Boundary: Q3 + 1.5 × IQR = 27.0 - 1.5 × 5.0 = 19.5
Observations with values 17.0 is labelled as inlier & 40.0 is labeled as outlier.
# Plotting box plots for each feature
plt.figure(figsize=(15, 15))
for i, column in enumerate(customer_engagement.columns[1:], 1):
plt.subplot(5, 5, i)
customer_engagement.boxplot(column=column)
plt.title(column)
plt.tight_layout()
plt.show()
The Isolation Forest algorithm is a powerful tool for detecting and managing outliers, especially in scenarios like customer segmentation using K-means clustering.
Unsupervised Approach: In customer segmentation, where we often lack labeled data, this unsupervised approach is advantageous.
Fast and Scalable: Isolation Forests are built using binary trees. Each tree (called an Isolation Tree) processes a random subset of the data.
“Few and Different” Principle: The underlying idea is that anomalies are both “few” (fewer in number) and “different” (distinct from the majority). Isolation Forest leverages this principle, data points that require more splits to isolate them (travel deeper into the tree) are less likely to be anomalies. Conversely, points ending up in shorter branches are more likely to be anomalies.
No Assumptions About Data Distribution: It works well even when the data is irregularly distributed or contains complex patterns. In customer segmentation, where customer behavior can vary significantly, this flexibility is crucial.
Direct Anomaly Detection: The Isolation Forest algorithm directly quantifies how far a data point is from the rest of the data. It doesn’t rely on distance metrics or similarity scores.
# Initializing the IsolationForest model with a contamination parameter of 0.05
model = IsolationForest(contamination=0.05, random_state=0)
# Fitting the model on our dataset (converting DataFrame to NumPy to avoid warning)
customer_engagement['Outlier_Scores'] = model.fit_predict(customer_engagement.iloc[:, 1:].to_numpy())
# Creating a new column to identify outliers (1 for inliers and -1 for outliers)
customer_engagement['Is_Outlier'] = [1 if x == -1 else 0 for x in customer_engagement['Outlier_Scores']]
customer_engagement
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | ... | Buying_Same_Item | From_Same_Location | Days_Between_Purchases | Monthly_Spending_Mean | Monthly_Spending_Std | Yearly_Spending_Mean | Yearly_Spending_Std | Spending_Trend | Outlier_Scores | Is_Outlier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | ... | 0 | 0 | 4.448980 | 1204.861667 | 585.342892 | 7229.170 | 937.156901 | -0.684772 | 1 | 0 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | ... | 1 | 0 | 8.655172 | 1260.959167 | 585.678930 | 7565.755 | 4165.969099 | 0.617309 | 1 | 0 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | ... | 0 | 0 | -0.980000 | 1006.054167 | 570.867239 | 6036.325 | 1416.426807 | 3.460524 | 1 | 0 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | ... | 1 | 0 | -5.461538 | 901.659167 | 482.868280 | 5409.955 | 1166.789829 | -4.053605 | 1 | 0 |
| 4 | User13 | 27 | 35 | 35 | 33 | 35 | 38 | 40 | 8104.59 | 13273.99 | ... | 0 | 0 | 3.550000 | 1230.004167 | 755.513942 | 7380.025 | 2622.001442 | -1.255994 | -1 | 1 |
| 5 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | ... | 1 | 0 | -4.673913 | 1107.518333 | 623.848447 | 6645.110 | 2353.010951 | -0.295626 | 1 | 0 |
| 6 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | ... | 0 | 0 | -3.981481 | 1284.206667 | 933.262190 | 7705.240 | 2028.547934 | -0.315243 | 1 | 0 |
| 7 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | ... | 0 | 0 | 3.304348 | 953.534167 | 492.675970 | 5721.205 | 264.224591 | 0.208918 | 1 | 0 |
| 8 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | ... | 0 | 1 | 5.725000 | 893.305000 | 423.069753 | 5359.830 | 386.193440 | -0.701129 | 1 | 0 |
| 9 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | ... | 1 | 0 | 3.311111 | 920.897500 | 347.307700 | 5525.385 | 773.398042 | 2.562159 | 1 | 0 |
| 10 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | ... | 1 | 0 | 2.350877 | 1216.821667 | 714.442483 | 7300.930 | 938.005430 | -0.528994 | 1 | 0 |
| 11 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | ... | 0 | 0 | 0.120000 | 1126.255000 | 749.061419 | 6757.530 | 759.347830 | -0.094429 | 1 | 0 |
| 12 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | ... | 0 | 0 | 5.282609 | 952.026667 | 623.558073 | 5712.160 | 599.216429 | 1.430459 | 1 | 0 |
| 13 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | ... | 0 | 0 | -1.020000 | 1095.616667 | 681.874269 | 6573.700 | 1107.046377 | -0.936123 | 1 | 0 |
| 14 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | ... | 0 | 0 | 2.809524 | 890.831667 | 440.318948 | 5344.990 | 785.270365 | 1.864607 | 1 | 0 |
| 15 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | ... | 0 | 0 | 0.186047 | 942.547500 | 510.093180 | 5655.285 | 912.033397 | 0.975988 | 1 | 0 |
| 16 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | ... | 0 | 0 | 3.163265 | 1005.869167 | 670.056435 | 6035.215 | 80.560676 | -1.027859 | 1 | 0 |
| 17 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | ... | 0 | 0 | -0.813953 | 812.149167 | 412.437383 | 4872.895 | 704.214714 | 0.975034 | 1 | 0 |
| 18 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | ... | 0 | 0 | -5.725806 | 1299.064167 | 533.605655 | 7794.385 | 1187.790900 | -0.172598 | 1 | 0 |
| 19 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | ... | 1 | 0 | -4.627451 | 1103.862500 | 632.628950 | 6623.175 | 1827.185136 | -0.027006 | 1 | 0 |
20 rows × 25 columns
# Calculate the percentage of inliers and outliers
outlier_percentage = customer_engagement['Is_Outlier'].value_counts(normalize=True) * 100
outlier_percentage
0 95.0 1 5.0 Name: Is_Outlier, dtype: float64
# Plotting the percentage of inliers and outliers
plt.figure(figsize=(12, 4))
outlier_percentage.plot(kind='barh', color=['#4287f5', '#ff0000'])
# Adding the percentage labels on the bars
for index, value in enumerate(outlier_percentage):
plt.text(value, index, f'{value:.2f}%', fontsize=15)
plt.title('Percentage of Inliers and Outliers')
plt.xticks(ticks=np.arange(0, 115, 5))
plt.xlabel('Percentage (%)')
plt.ylabel('Is Outlier')
plt.gca().invert_yaxis()
plt.show()
# Separate the outliers for analysis
outliers_data = customer_engagement[customer_engagement['Is_Outlier'] == 1]
# Remove the outliers from the main dataset
customer_engagement_cleaned = customer_engagement[customer_engagement['Is_Outlier'] == 0]
# Drop the 'Outlier_Scores' and 'Is_Outlier' columns
customer_engagement_cleaned = customer_engagement_cleaned.drop(columns=['Outlier_Scores', 'Is_Outlier'])
# Reset the index of the cleaned data
customer_engagement_cleaned.reset_index(drop=True, inplace=True)
# Getting the number of rows in the cleaned customer dataset
customer_engagement_cleaned.shape[0]
19
customer_engagement_cleaned
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | ... | Same_Age_or_Above | Is_Male | Buying_Same_Item | From_Same_Location | Days_Between_Purchases | Monthly_Spending_Mean | Monthly_Spending_Std | Yearly_Spending_Mean | Yearly_Spending_Std | Spending_Trend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | ... | 0 | 0 | 0 | 0 | 4.448980 | 1204.861667 | 585.342892 | 7229.170 | 937.156901 | -0.684772 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | ... | 1 | 1 | 1 | 0 | 8.655172 | 1260.959167 | 585.678930 | 7565.755 | 4165.969099 | 0.617309 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | ... | 1 | 1 | 0 | 0 | -0.980000 | 1006.054167 | 570.867239 | 6036.325 | 1416.426807 | 3.460524 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | ... | 1 | 0 | 1 | 0 | -5.461538 | 901.659167 | 482.868280 | 5409.955 | 1166.789829 | -4.053605 |
| 4 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | ... | 1 | 0 | 1 | 0 | -4.673913 | 1107.518333 | 623.848447 | 6645.110 | 2353.010951 | -0.295626 |
| 5 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | ... | 0 | 0 | 0 | 0 | -3.981481 | 1284.206667 | 933.262190 | 7705.240 | 2028.547934 | -0.315243 |
| 6 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | ... | 0 | 0 | 0 | 0 | 3.304348 | 953.534167 | 492.675970 | 5721.205 | 264.224591 | 0.208918 |
| 7 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | ... | 0 | 0 | 0 | 1 | 5.725000 | 893.305000 | 423.069753 | 5359.830 | 386.193440 | -0.701129 |
| 8 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | ... | 0 | 1 | 1 | 0 | 3.311111 | 920.897500 | 347.307700 | 5525.385 | 773.398042 | 2.562159 |
| 9 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | ... | 0 | 0 | 1 | 0 | 2.350877 | 1216.821667 | 714.442483 | 7300.930 | 938.005430 | -0.528994 |
| 10 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | ... | 0 | 0 | 0 | 0 | 0.120000 | 1126.255000 | 749.061419 | 6757.530 | 759.347830 | -0.094429 |
| 11 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | ... | 0 | 1 | 0 | 0 | 5.282609 | 952.026667 | 623.558073 | 5712.160 | 599.216429 | 1.430459 |
| 12 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | ... | 0 | 0 | 0 | 0 | -1.020000 | 1095.616667 | 681.874269 | 6573.700 | 1107.046377 | -0.936123 |
| 13 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | ... | 1 | 1 | 0 | 0 | 2.809524 | 890.831667 | 440.318948 | 5344.990 | 785.270365 | 1.864607 |
| 14 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | ... | 0 | 0 | 0 | 0 | 0.186047 | 942.547500 | 510.093180 | 5655.285 | 912.033397 | 0.975988 |
| 15 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | ... | 0 | 0 | 0 | 0 | 3.163265 | 1005.869167 | 670.056435 | 6035.215 | 80.560676 | -1.027859 |
| 16 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | ... | 0 | 0 | 0 | 0 | -0.813953 | 812.149167 | 412.437383 | 4872.895 | 704.214714 | 0.975034 |
| 17 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | ... | 0 | 0 | 0 | 0 | -5.725806 | 1299.064167 | 533.605655 | 7794.385 | 1187.790900 | -0.172598 |
| 18 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | ... | 0 | 1 | 1 | 0 | -4.627451 | 1103.862500 | 632.628950 | 6623.175 | 1827.185136 | -0.027006 |
19 rows × 23 columns
Correlation analysis helps identify which features are strongly related to each other. Multicollinearity occurs when two or more features are highly correlated. In K-means, multicollinearity can lead to biased cluster assignments. Reducing dimensionality simplifies the clustering process. Techniques like Principal Component Analysis (PCA) can be applied based on correlations.
# Reset background style
sns.set_style('whitegrid')
# Calculate the correlation matrix excluding the 'UserID' column
corr = customer_engagement_cleaned.drop(columns=['UserID']).corr()
# Define a custom colormap
colors = ['#ff0000', '#ffcccc', 'white', '#ffcccc', '#ff0000']
my_cmap = LinearSegmentedColormap.from_list('custom_map', colors, N=256)
# Create a mask to only show the lower triangle of the matrix (since it's mirrored around its top-left to bottom-right diagonal)
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask, k=1)] = True
# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr, mask=mask, cmap=my_cmap, annot=True, center=0, fmt='.2f', linewidths=2)
plt.title('Correlation Matrix', fontsize=14)
plt.show()
Color Coding:
Darker colors represent stronger correlations.
Lighter colors indicate weaker correlations.
Positive correlations (when one variable increases, the other tends to increase) are usually shown in warm colors.
Negative correlations (when one variable increases, the other tends to decrease) are usually shown in cool colors.
Pairs of variables that have high correlations:
Total Products Purschased & Total Transactions
Total Products in Cart & Total Products Searched
Total Spend & Total Transactions
Total Spend & Products Purschased
Willing to Spend & Total Products in Cart
Monthly_Spending_Mean & Total Products Searched
Monthly_Spending_Mean & Total Products in Cart
Monthly_Spending_Mean & Willing to Spend
Yearly_Spending_Mean & Total Products Searched
Yearly_Spending_Mean & Total Products in Cart
Yearly_Spending_Mean & Willing to Spend
Yearly_Spending_Mean & Monthly_Spending_Mean
Now to achive uncorrelated variables, have to apply PCA technique.
K-Means Clustering:
K-means relies on distance calculations between data points. Scaling ensures that all features contribute equally to the algorithm.
Proper scaling prevents biased clusters based on feature magnitudes.
Scaling accelerates convergence by ensuring similar ranges for all features. Faster convergence leads to quicker model training.
Scaling makes the algorithm more stable across different datasets.
PCA (Principal Component Analysis):
PCA aims to find orthogonal axes (principal components) that capture maximum variance. Scaling ensures that features contribute equally to the principal components.
Scaled features allow meaningful interpretation of principal components.
Scaling ensures that eigenvalues reflect true variance proportions.
Scaling maintains a stable covariance matrix for accurate eigen decomposition.
# Initialize the StandardScaler
scaler = StandardScaler()
# List of columns that don't need to be scaled
columns_to_exclude = ['UserID', 'Unique Products Purchased', 'Unique Products to be Purchased', 'Same_Age_or_Above', 'Is_Male', 'Buying_Same_Item', 'From_Same_Location']
# List of columns that need to be scaled
columns_to_scale = customer_engagement_cleaned.columns.difference(columns_to_exclude)
# Copy the cleaned dataset
customer_engagement_scaled = customer_engagement_cleaned.copy()
# Applying the scaler to the necessary columns in the dataset
customer_engagement_scaled[columns_to_scale] = scaler.fit_transform(customer_engagement_scaled[columns_to_scale])
customer_engagement_scaled
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | ... | Same_Age_or_Above | Is_Male | Buying_Same_Item | From_Same_Location | Days_Between_Purchases | Monthly_Spending_Mean | Monthly_Spending_Std | Yearly_Spending_Mean | Yearly_Spending_Std | Spending_Trend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | -0.141673 | -1.104117 | -1.104117 | -0.072369 | 0.568607 | -0.108710 | 0.512360 | -0.374045 | 1.372914 | ... | 0 | 0 | 0 | 0 | 0.932961 | 1.048360 | 0.042255 | 1.048360 | -0.267942 | -0.552308 |
| 1 | User10 | 1.028672 | 1.143550 | 1.143550 | 0.320492 | 0.311380 | 1.497786 | 2.259640 | 0.881568 | 0.571790 | ... | 1 | 1 | 1 | 0 | 1.961974 | 1.431774 | 0.044741 | 1.431774 | 3.316030 | 0.287599 |
| 2 | User11 | -0.141673 | 0.394327 | 0.394327 | -0.072369 | 0.568607 | 1.038787 | 0.013137 | -0.732254 | 0.278845 | ... | 1 | 1 | 0 | 0 | -0.395198 | -0.310447 | -0.064840 | -0.310447 | 0.264046 | 2.121612 |
| 3 | User12 | -0.843880 | -1.853339 | -1.853339 | -1.250951 | -0.717528 | -1.944705 | 0.262749 | -1.205630 | -1.594347 | ... | 1 | 0 | 1 | 0 | -1.491572 | -1.023965 | -0.715880 | -1.023965 | -0.013050 | -2.725371 |
| 4 | User14 | 0.443499 | 1.643031 | 1.643031 | -0.858090 | -0.717528 | -0.338210 | 0.013137 | 2.355809 | 0.167924 | ... | 1 | 0 | 1 | 0 | -1.298885 | 0.383039 | 0.327129 | 0.383039 | 1.303652 | -0.301289 |
| 5 | User15 | -0.843880 | 1.393290 | 1.393290 | 0.516922 | 0.054153 | 0.809287 | 0.512360 | 1.601975 | 1.662721 | ... | 0 | 0 | 0 | 0 | -1.129487 | 1.590666 | 2.616255 | 1.590666 | 0.943499 | -0.313943 |
| 6 | User16 | -0.492777 | -0.854376 | -0.854376 | -0.465230 | 0.825834 | -0.108710 | -0.735697 | -1.094965 | 0.107459 | ... | 0 | 0 | 0 | 0 | 0.652935 | -0.669410 | -0.643320 | -0.669410 | -1.014895 | 0.024166 |
| 7 | User17 | -0.726846 | -0.354895 | -0.354895 | 0.124061 | -2.518116 | -0.567709 | -1.983753 | 0.351808 | -1.426735 | ... | 0 | 0 | 0 | 1 | 1.245129 | -1.081064 | -1.158286 | -1.081064 | -0.879510 | -0.562859 |
| 8 | User18 | -0.024639 | -1.353858 | -1.353858 | -0.268799 | -0.460301 | 0.350289 | 0.512360 | -0.845415 | -0.465473 | ... | 0 | 1 | 1 | 0 | 0.654590 | -0.892475 | -1.718794 | -0.892475 | -0.449714 | 1.542123 |
| 9 | User19 | -1.077949 | -0.604635 | -0.604635 | 1.891934 | 1.854741 | 1.268286 | 1.261194 | -0.819101 | 1.008684 | ... | 0 | 0 | 1 | 0 | 0.419676 | 1.130104 | 0.997368 | 1.130104 | -0.267000 | -0.451823 |
| 10 | User2 | 2.550120 | 0.394327 | 0.394327 | 1.106213 | 0.311380 | -0.797208 | -0.985308 | 0.480384 | -0.001579 | ... | 0 | 0 | 0 | 0 | -0.126091 | 0.511100 | 1.253488 | 0.511100 | -0.465310 | -0.171507 |
| 11 | User20 | -0.024639 | 1.393290 | 1.393290 | -0.661660 | 1.083061 | -0.567709 | 0.013137 | 1.073030 | -0.466737 | ... | 0 | 1 | 0 | 0 | 1.136902 | -0.679714 | 0.324981 | -0.679714 | -0.643055 | 0.812121 |
| 12 | User3 | -0.960915 | 1.393290 | 1.393290 | 0.909782 | 0.568607 | -0.338210 | 0.262749 | 1.381114 | -0.297200 | ... | 0 | 0 | 0 | 0 | -0.404983 | 0.301694 | 0.756420 | 0.301694 | -0.079365 | -0.714441 |
| 13 | User4 | 1.145706 | -0.105154 | -0.105154 | -1.054521 | -0.974755 | -1.256207 | -0.486085 | -0.358086 | -0.908405 | ... | 1 | 1 | 0 | 0 | 0.531880 | -1.097968 | -1.030672 | -1.097968 | -0.436536 | 1.092168 |
| 14 | User5 | 1.964947 | -0.354895 | -0.354895 | -1.054521 | -1.489208 | -0.338210 | -0.735697 | -0.515443 | -0.440819 | ... | 0 | 0 | 0 | 0 | -0.109933 | -0.744502 | -0.514463 | -0.744502 | -0.295829 | 0.518964 |
| 15 | User6 | -0.960915 | -0.354895 | -0.354895 | -0.661660 | -0.460301 | 0.120789 | -0.735697 | -0.502893 | -0.353255 | ... | 0 | 0 | 0 | 0 | 0.618421 | -0.311712 | 0.668988 | -0.311712 | -1.218761 | -0.773616 |
| 16 | User7 | -0.843880 | -0.604635 | -0.604635 | -1.447381 | -0.460301 | -1.485707 | -0.985308 | -0.562852 | -1.708332 | ... | 0 | 0 | 0 | 0 | -0.354576 | -1.635747 | -1.236947 | -1.635747 | -0.526507 | 0.518349 |
| 17 | User8 | -0.258708 | 0.394327 | 0.394327 | 2.088364 | 0.825834 | 1.727285 | 1.760417 | -0.113770 | 1.802656 | ... | 0 | 0 | 0 | 0 | -1.556223 | 1.692214 | -0.340511 | 1.692214 | 0.010261 | -0.221930 |
| 18 | User9 | 0.209430 | -0.604635 | -0.604635 | 0.909782 | 0.825834 | 1.038787 | -0.735697 | -1.001233 | 0.689888 | ... | 0 | 1 | 1 | 0 | -1.287519 | 0.358052 | 0.392090 | 0.358052 | 0.719987 | -0.128016 |
19 rows × 23 columns
To remove multicollinear features issues.
PCA compresses the original features into a smaller set of principal components. These components capture most of the data’s variance. By retaining only a few components, we reduce storage requirements and computational complexity.
PCA focuses on the most informative dimensions, filtering out noise. Cleaner data leads to better clustering results.
Clusters become more distinct, as irrelevant features are removed. The reduced feature space enhances the algorithm’s efficiency.
PCA transforms data into a lower-dimensional space (e.g., 2D or 3D), allowing visualization.
# Setting UserID as the index column
customer_engagement_scaled.set_index('UserID', inplace=True)
# Apply PCA
pca = PCA().fit(customer_engagement_scaled)
# Calculate the Cumulative Sum of the Explained Variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Set the optimal k value
optimal_k = 6
# Set seaborn plot style
sns.set(rc={'axes.facecolor': '#fcf0dc'}, style='darkgrid')
# Plot the cumulative explained variance against the number of components
plt.figure(figsize=(20, 10))
# Bar chart for the explained variance of each component
barplot = sns.barplot(x=list(range(1, len(cumulative_explained_variance) + 1)), y=explained_variance_ratio, color='#fcc36d', alpha=0.8)
# Line plot for the cumulative explained variance
lineplot, = plt.plot(range(0, len(cumulative_explained_variance)), cumulative_explained_variance, marker='o', linestyle='--', color='#ff6200', linewidth=2)
# Plot optimal k value line
optimal_k_line = plt.axvline(optimal_k - 1, color='red', linestyle='--', label=f'Optimal k value = {optimal_k}')
# Set labels and title
plt.xlabel('Number of Components', fontsize=14)
plt.ylabel('Explained Variance', fontsize=14)
plt.title('Cumulative Variance vs. Number of Components', fontsize=18)
# Customize ticks and legend
plt.xticks(range(0, len(cumulative_explained_variance)))
plt.legend(handles=[barplot.patches[0], lineplot, optimal_k_line], labels=['Explained Variance of Each Component', 'Cumulative Explained Variance', f'Optimal k value = {optimal_k}'], loc=(0.62, 0.1), frameon=True, framealpha=1.0, edgecolor='#ff6200')
# Display the variance values for both graphs on the plots
x_offset = -0.3
y_offset = 0.01
for i, (ev_ratio, cum_ev_ratio) in enumerate(zip(explained_variance_ratio, cumulative_explained_variance)):
plt.text(i, ev_ratio, f"{ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
if i > 0:
plt.text(i + x_offset, cum_ev_ratio + y_offset, f"{cum_ev_ratio:.2f}", ha="center", va="bottom", fontsize=10)
plt.grid(axis='both')
plt.show()
When selecting the optimal number of components, we typically seek a point where adding another component no longer significantly boosts the cumulative explained variance, often termed as the "elbow point" in the curve.
From the plotted data, it's evident that the increase in cumulative variance begins to down after the 6th component, accounting for 87% of the total variance.
Considering the objective of customer segmentation, our aim is to retain adequate information to effectively identify distinct customer groups. Hence, retaining the first 6 components appears to strike a balance. These components collectively explain a substantial portion of the total variance while simultaneously reducing the dataset's dimensionality.
# Creating a PCA object with 6 components
pca = PCA(n_components=6)
# Fitting and transforming the original data to the new PCA dataframe
customer_engagement_pca = pca.fit_transform(customer_engagement_scaled)
# Creating a new dataframe from the PCA dataframe, with columns labeled PC1, PC2, etc.
customer_engagement_pca = pd.DataFrame(customer_engagement_pca, columns=['PC'+str(i+1) for i in range(pca.n_components_)])
# Adding the UserID index back to the new PCA dataframe
customer_engagement_pca.index = customer_engagement_scaled.index
customer_engagement_pca
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| UserID | ||||||
| User1 | 0.936563 | -1.143190 | 0.910283 | -0.009028 | 2.010212 | 1.250703 |
| User10 | 4.083905 | 0.884282 | -1.529844 | 3.223612 | 1.358676 | -0.113040 |
| User11 | 0.449232 | -1.182670 | -2.514097 | -0.001027 | -1.539772 | -1.158315 |
| User12 | -4.361374 | -0.856840 | 3.896517 | 1.770110 | -0.579031 | -1.073711 |
| User14 | 1.513844 | 3.737479 | 1.196679 | 1.048500 | -0.684657 | -0.629263 |
| User15 | 4.389574 | 1.534803 | 1.544999 | -0.930572 | -0.279368 | -0.287863 |
| User16 | -1.653292 | -1.770690 | -0.667237 | -1.033714 | 0.281043 | -0.088000 |
| User17 | -3.555436 | 1.687557 | 0.776401 | -0.507149 | 1.497421 | 0.931333 |
| User18 | -2.388122 | -1.352195 | -1.020745 | 1.467038 | 0.905118 | 0.044197 |
| User19 | 2.777201 | -3.054716 | 0.514527 | -0.735451 | 0.937913 | -0.113852 |
| User2 | 0.529988 | 0.986667 | 0.545649 | -0.362776 | -1.723708 | 2.640775 |
| User20 | 0.334986 | 1.792064 | -1.835964 | -1.354093 | 0.488958 | -0.832498 |
| User3 | 1.846067 | 1.609827 | 0.816904 | -1.487781 | 0.274646 | -0.924438 |
| User4 | -2.650367 | 0.879670 | -1.954325 | 0.487366 | -0.431188 | 0.229777 |
| User5 | -1.780206 | 0.592616 | -1.448589 | 0.163298 | -0.676998 | 1.370638 |
| User6 | -1.190682 | -0.477178 | 0.430680 | -1.438194 | 0.525688 | -0.066020 |
| User7 | -3.839584 | 0.407699 | -0.567468 | -0.669786 | 0.034044 | -1.241481 |
| User8 | 3.924105 | -2.071793 | 0.466635 | -0.136173 | -0.451456 | -0.096538 |
| User9 | 0.633596 | -2.203393 | 0.438995 | 0.505821 | -1.947541 | 0.157595 |
# Define a function to highlight the top 3 absolute values in each column of a dataframe
def top_3(column):
top3 = column.abs().nlargest(3).index
return ['background-color: #ffeacc' if i in top3 else '' for i in column.index]
# Create the PCA component DataFrame and apply the highlighting function
pc_df = pd.DataFrame(pca.components_.T, columns=['PC{}'.format(i+1) for i in range(pca.n_components_)],
index=customer_engagement_scaled.columns)
pc_df.style.apply(top_3, axis=0)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| Days_Since_Last_Purchase | 0.012421 | 0.155683 | -0.251566 | 0.307145 | -0.366469 | 0.686618 |
| Total Transactions | 0.246420 | 0.390521 | -0.139453 | -0.123126 | -0.131800 | -0.167960 |
| Total Products Purchased | 0.246420 | 0.390521 | -0.139453 | -0.123126 | -0.131800 | -0.167960 |
| Total Products Searched | 0.268188 | -0.218762 | 0.100864 | -0.126280 | -0.035974 | 0.248746 |
| Total Products in Wishlist | 0.233460 | -0.287695 | -0.060023 | -0.170468 | -0.060605 | -0.208527 |
| Total Products in Cart | 0.289884 | -0.243663 | -0.151697 | 0.084687 | 0.044878 | -0.018738 |
| Total Products in Click History | 0.263380 | -0.146000 | 0.006987 | 0.357851 | 0.231570 | -0.283946 |
| Total Spend | 0.193791 | 0.489730 | 0.063556 | -0.049532 | 0.055873 | -0.056861 |
| Willing to Spend | 0.330902 | -0.182423 | 0.024622 | -0.032130 | -0.006959 | 0.166456 |
| Average Transaction Value | -0.015456 | 0.362477 | 0.365909 | 0.121861 | 0.384783 | 0.181243 |
| Unique Products Purchased | 0.258305 | 0.065222 | -0.309175 | -0.287152 | 0.312128 | -0.017150 |
| Unique Products to be Purchased | 0.075481 | -0.053960 | 0.041539 | 0.025785 | -0.051501 | 0.004600 |
| Same_Age_or_Above | -0.007249 | 0.062794 | -0.021540 | 0.244118 | -0.087809 | -0.157074 |
| Is_Male | 0.003481 | -0.021444 | -0.200295 | 0.161861 | -0.054565 | -0.095707 |
| Buying_Same_Item | 0.016975 | -0.051611 | 0.083206 | 0.272202 | -0.000446 | -0.098899 |
| From_Same_Location | -0.026716 | 0.030610 | 0.018478 | -0.018963 | 0.070090 | 0.053301 |
| Days_Between_Purchases | -0.041774 | 0.034261 | -0.342288 | 0.036256 | 0.668519 | 0.265742 |
| Monthly_Spending_Mean | 0.355358 | -0.075228 | 0.150140 | 0.072504 | 0.048136 | 0.177516 |
| Monthly_Spending_Std | 0.270756 | 0.094767 | 0.191413 | -0.288405 | -0.193403 | 0.097605 |
| Yearly_Spending_Mean | 0.355358 | -0.075228 | 0.150140 | 0.072504 | 0.048136 | 0.177516 |
| Yearly_Spending_Std | 0.223646 | 0.135225 | -0.010937 | 0.581604 | -0.049576 | -0.176974 |
| Spending_Trend | 0.012840 | 0.002409 | -0.615047 | 0.011609 | -0.109695 | -0.027400 |
Determining the Optimal Number of Clusters:
Elbow Method
Concept:
The Elbow method evaluates the sum of squared distances (inertia) for different values of K (number of clusters).
It looks for the “elbow” point in the plot where the rate of decrease in inertia slows down.
How It Works:
Run K-means clustering for a range of K values.
Calculate the inertia (sum of squared distances) for each K.
Plot the inertia against K.
The “elbow” point indicates the optimal K.
Pros:
Simple and intuitive.
Provides a visual clue for choosing K.
Cons:
Ambiguous if the plot lacks a clear elbow point.
Doesn’t work well for non-convex clusters.
# Set plot style, and background color
sns.set(style='darkgrid', rc={'axes.facecolor': '#fcf0dc'})
# Set the color palette for the plot
sns.set_palette(['#ff6200'])
# Instantiate the clustering model with the specified parameters
km = KMeans(init='k-means++', n_init=10, max_iter=100, random_state=0)
# Create a figure and axis with the desired size
fig, ax = plt.subplots(figsize=(12, 5))
# Instantiate the KElbowVisualizer with the model and range of k values, and disable the timing plot
visualizer = KElbowVisualizer(km, k=(2, 15), timings=False, ax=ax)
# Fit the data to the visualizer
visualizer.fit(customer_engagement_pca)
# Finalize and render the figure
visualizer.show();
Silhouette Method
Concept:
Silhouette analysis assesses the separation distance between resulting clusters.
Measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
How It Works:
Calculate the silhouette score for each data point.
Calculate a_i: Compute the average distance between point i and all other points within its cluster.
Calculate b_i: Compute the average distance between point i and all points in the nearest cluster to its own.
Now compute the silhouette score (s_i) = (b_i−a_i)/max(b_i,a_i)
Silhouette score ranges from -1 to 1:
Near +1: Sample is far from neighboring clusters.
0: Sample is on the decision boundary between clusters.
Negative: Sample might be assigned to the wrong cluster.
Choose the K value that maximizes the average silhouette score.
Pros:
Considers both cohesion and separation.
Works well for non-convex clusters.
Cons:
Computationally expensive for large datasets.
Choosing Between Them:
Silhouette Method is better than Elbow Method.
# Perform Silhouette analysis for a range of k values and visualize the results.
def silhouette_analysis(df, start_k, stop_k, figsize=(15, 16)):
# Set the size of the figure
plt.figure(figsize=figsize)
# Create a grid with (stop_k - start_k + 1) rows and 2 columns
grid = gridspec.GridSpec(stop_k - start_k + 1, 2)
# Assign the first plot to the first row and both columns
first_plot = plt.subplot(grid[0, :])
# First plot: Silhouette scores for different k values
sns.set_palette(['darkorange'])
silhouette_scores = []
# Iterate through the range of k values
for k in range(start_k, stop_k + 1):
km = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=100, random_state=0)
km.fit(df)
labels = km.predict(df)
score = silhouette_score(df, labels)
silhouette_scores.append(score)
best_k = start_k + silhouette_scores.index(max(silhouette_scores))
plt.plot(range(start_k, stop_k + 1), silhouette_scores, marker='o')
plt.xticks(range(start_k, stop_k + 1))
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette score')
plt.title('Average Silhouette Score for Different k Values', fontsize=15)
# Add the optimal k value text to the plot
optimal_k_text = f'The k value with the highest Silhouette score is: {best_k}'
plt.text(10, 0.23, optimal_k_text, fontsize=12, verticalalignment='bottom', horizontalalignment='left', bbox=dict(facecolor='#fcc36d', edgecolor='#ff6200', boxstyle='round, pad=0.5'))
# Second plot (subplot): Silhouette plots for each k value
colors = sns.color_palette("bright")
for i in range(start_k, stop_k + 1):
km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=0)
row_idx, col_idx = divmod(i - start_k, 2)
# Assign the plots to the second, third, and fourth rows
ax = plt.subplot(grid[row_idx + 1, col_idx])
visualizer = SilhouetteVisualizer(km, colors=colors, ax=ax)
visualizer.fit(df)
# Add the Silhouette score text to the plot
score = silhouette_score(df, km.labels_)
ax.text(0.97, 0.02, f'Silhouette Score: {score:.2f}', fontsize=12, ha='right', transform=ax.transAxes, color='red')
ax.set_title(f'Silhouette Plot for {i} Clusters', fontsize=15)
plt.tight_layout()
plt.show()
silhouette_analysis(customer_engagement_pca, 3, 12, figsize=(20, 50))
To interpret silhouette plots and determine the optimal number of clusters (k), the following criteria are considered:
Silhouette Score Width:
Wide Widths (closer to +1): Indicate well-separated clusters, with data points distinct from other clusters.
Narrow Widths (closer to -1): Suggest poorly defined clusters, where data points are not distinctly separated.
Average Silhouette Score:
High Average Width: Reflects well-separated clusters.
Low Average Width: Indicates poor separation between clusters.
Cluster Thickness:
Uniform Thickness: Suggests balanced clusters with a roughly equal number of data points.
Variable Thickness: Signifies an uneven distribution of data points across clusters.
Clear Peaks: Indicate potential optimal k values.
Uniform Widths: Prefer silhouette plots with consistent widths across clusters.
Variable Widths: Avoid fluctuations, suggesting poorly defined clusters.
Maximize Overall Average Silhouette Score: Choose the k value with the highest average silhouette score across clusters.
Avoid Below-Average Scores: Ensure most clusters have above-average silhouette scores to maintain optimal clustering.
Consistent Cluster Formation: Visually examine silhouette plots for consistent and structured clusters.
Cluster Compactness: Look for compact clusters, with data points having silhouette scores closer to +1, indicating better clustering.
K-Means Clustering:
K-means is an unsupervised learning algorithm used for clustering.
The “K” in K-means represents the number of predefined clusters we want to create.
The algorithm partitions data points into K clusters based on similarity.
Each cluster is associated with a centroid (a representative point).
The main idea is to minimize the distance between data points and their respective cluster centroids.
How K-Means Works:-
Initialization:
Choose the number of clusters (K).
Randomly initialize K centroids.
Assignment:
Assign each data point to the nearest centroid (cluster).
Calculate distances (usually Euclidean distance) between data points and centroids.
Update Centroids:
Recalculate the mean (centroid) for each cluster.
Move the centroids to the new mean positions.
Repeat Steps 2 and 3:
Iterate until the centroids no longer change significantly.
Final Clusters:
The resulting clusters represent distinct customer segments.
# Apply KMeans clustering using the optimal k
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, max_iter=100, random_state=0)
kmeans.fit(customer_engagement_pca)
# Get the frequency of each cluster
cluster_frequencies = Counter(kmeans.labels_)
# Create a mapping from old labels to new labels based on frequency
label_mapping = {label: new_label for new_label, (label, _) in enumerate(cluster_frequencies.most_common())}
# Reverse the mapping to assign labels as per your criteria
label_mapping = {v: k for k, v in {3: 2, 2: 1, 1: 0, 0: 3}.items()}
# Apply the mapping to get the new labels
new_labels = np.array([label_mapping[label] for label in kmeans.labels_])
# Append the new cluster labels back to the original dataset
customer_engagement_cleaned['cluster'] = new_labels
# Append the new cluster labels to the PCA version of the dataset
customer_engagement_pca['cluster'] = new_labels
customer_engagement_cleaned
| UserID | Days_Since_Last_Purchase | Total Transactions | Total Products Purchased | Total Products Searched | Total Products in Wishlist | Total Products in Cart | Total Products in Click History | Total Spend | Willing to Spend | ... | Is_Male | Buying_Same_Item | From_Same_Location | Days_Between_Purchases | Monthly_Spending_Mean | Monthly_Spending_Std | Yearly_Spending_Mean | Yearly_Spending_Std | Spending_Trend | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | User1 | 8 | 20 | 20 | 25 | 27 | 25 | 27 | 5776.59 | 11456.50 | ... | 0 | 0 | 0 | 4.448980 | 1204.861667 | 585.342892 | 7229.170 | 937.156901 | -0.684772 | 1 |
| 1 | User10 | 18 | 29 | 29 | 27 | 26 | 32 | 34 | 7441.41 | 10486.52 | ... | 1 | 1 | 0 | 8.655172 | 1260.959167 | 585.678930 | 7565.755 | 4165.969099 | 0.617309 | 0 |
| 2 | User11 | 8 | 26 | 26 | 25 | 27 | 30 | 25 | 5301.64 | 10131.83 | ... | 1 | 0 | 0 | -0.980000 | 1006.054167 | 570.867239 | 6036.325 | 1416.426807 | 3.460524 | 1 |
| 3 | User12 | 2 | 17 | 17 | 19 | 22 | 17 | 26 | 4673.99 | 7863.82 | ... | 0 | 1 | 0 | -5.461538 | 901.659167 | 482.868280 | 5409.955 | 1166.789829 | -4.053605 | 2 |
| 4 | User14 | 13 | 31 | 31 | 21 | 22 | 24 | 25 | 9396.11 | 9997.53 | ... | 0 | 1 | 0 | -4.673913 | 1107.518333 | 623.848447 | 6645.110 | 2353.010951 | -0.295626 | 3 |
| 5 | User15 | 2 | 30 | 30 | 28 | 25 | 29 | 27 | 8396.60 | 11807.39 | ... | 0 | 0 | 0 | -3.981481 | 1284.206667 | 933.262190 | 7705.240 | 2028.547934 | -0.315243 | 3 |
| 6 | User16 | 5 | 21 | 21 | 23 | 28 | 25 | 22 | 4820.72 | 9924.32 | ... | 0 | 0 | 0 | 3.304348 | 953.534167 | 492.675970 | 5721.205 | 264.224591 | 0.208918 | 2 |
| 7 | User17 | 3 | 23 | 23 | 26 | 15 | 23 | 17 | 6739.00 | 8066.76 | ... | 0 | 0 | 1 | 5.725000 | 893.305000 | 423.069753 | 5359.830 | 386.193440 | -0.701129 | 2 |
| 8 | User18 | 9 | 19 | 19 | 24 | 23 | 27 | 27 | 5151.60 | 9230.63 | ... | 1 | 1 | 0 | 3.311111 | 920.897500 | 347.307700 | 5525.385 | 773.398042 | 2.562159 | 2 |
| 9 | User19 | 0 | 22 | 22 | 35 | 32 | 31 | 30 | 5186.49 | 11015.50 | ... | 0 | 1 | 0 | 2.350877 | 1216.821667 | 714.442483 | 7300.930 | 938.005430 | -0.528994 | 1 |
| 10 | User2 | 31 | 26 | 26 | 31 | 26 | 22 | 21 | 6909.48 | 9792.30 | ... | 0 | 0 | 0 | 0.120000 | 1126.255000 | 749.061419 | 6757.530 | 759.347830 | -0.094429 | 3 |
| 11 | User20 | 9 | 30 | 30 | 22 | 29 | 23 | 25 | 7695.27 | 9229.10 | ... | 1 | 0 | 0 | 5.282609 | 952.026667 | 623.558073 | 5712.160 | 599.216429 | 1.430459 | 3 |
| 12 | User3 | 1 | 30 | 30 | 30 | 27 | 24 | 26 | 8103.76 | 9434.37 | ... | 0 | 0 | 0 | -1.020000 | 1095.616667 | 681.874269 | 6573.700 | 1107.046377 | -0.936123 | 3 |
| 13 | User4 | 19 | 24 | 24 | 20 | 21 | 20 | 23 | 5797.75 | 8694.34 | ... | 1 | 0 | 0 | 2.809524 | 890.831667 | 440.318948 | 5344.990 | 785.270365 | 1.864607 | 2 |
| 14 | User5 | 26 | 23 | 23 | 20 | 19 | 24 | 22 | 5589.11 | 9260.48 | ... | 0 | 0 | 0 | 0.186047 | 942.547500 | 510.093180 | 5655.285 | 912.033397 | 0.975988 | 2 |
| 15 | User6 | 1 | 23 | 23 | 22 | 23 | 26 | 22 | 5605.75 | 9366.50 | ... | 0 | 0 | 0 | 3.163265 | 1005.869167 | 670.056435 | 6035.215 | 80.560676 | -1.027859 | 2 |
| 16 | User7 | 2 | 22 | 22 | 18 | 23 | 19 | 21 | 5526.25 | 7725.81 | ... | 0 | 0 | 0 | -0.813953 | 812.149167 | 412.437383 | 4872.895 | 704.214714 | 0.975034 | 2 |
| 17 | User8 | 7 | 26 | 26 | 36 | 28 | 33 | 32 | 6121.69 | 11976.82 | ... | 0 | 0 | 0 | -5.725806 | 1299.064167 | 533.605655 | 7794.385 | 1187.790900 | -0.172598 | 1 |
| 18 | User9 | 11 | 22 | 22 | 30 | 28 | 30 | 22 | 4945.00 | 10629.51 | ... | 1 | 1 | 0 | -4.627451 | 1103.862500 | 632.628950 | 6623.175 | 1827.185136 | -0.027006 | 1 |
19 rows × 24 columns
customer_engagement_pca
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | cluster | |
|---|---|---|---|---|---|---|---|
| UserID | |||||||
| User1 | 0.936563 | -1.143190 | 0.910283 | -0.009028 | 2.010212 | 1.250703 | 1 |
| User10 | 4.083905 | 0.884282 | -1.529844 | 3.223612 | 1.358676 | -0.113040 | 0 |
| User11 | 0.449232 | -1.182670 | -2.514097 | -0.001027 | -1.539772 | -1.158315 | 1 |
| User12 | -4.361374 | -0.856840 | 3.896517 | 1.770110 | -0.579031 | -1.073711 | 2 |
| User14 | 1.513844 | 3.737479 | 1.196679 | 1.048500 | -0.684657 | -0.629263 | 3 |
| User15 | 4.389574 | 1.534803 | 1.544999 | -0.930572 | -0.279368 | -0.287863 | 3 |
| User16 | -1.653292 | -1.770690 | -0.667237 | -1.033714 | 0.281043 | -0.088000 | 2 |
| User17 | -3.555436 | 1.687557 | 0.776401 | -0.507149 | 1.497421 | 0.931333 | 2 |
| User18 | -2.388122 | -1.352195 | -1.020745 | 1.467038 | 0.905118 | 0.044197 | 2 |
| User19 | 2.777201 | -3.054716 | 0.514527 | -0.735451 | 0.937913 | -0.113852 | 1 |
| User2 | 0.529988 | 0.986667 | 0.545649 | -0.362776 | -1.723708 | 2.640775 | 3 |
| User20 | 0.334986 | 1.792064 | -1.835964 | -1.354093 | 0.488958 | -0.832498 | 3 |
| User3 | 1.846067 | 1.609827 | 0.816904 | -1.487781 | 0.274646 | -0.924438 | 3 |
| User4 | -2.650367 | 0.879670 | -1.954325 | 0.487366 | -0.431188 | 0.229777 | 2 |
| User5 | -1.780206 | 0.592616 | -1.448589 | 0.163298 | -0.676998 | 1.370638 | 2 |
| User6 | -1.190682 | -0.477178 | 0.430680 | -1.438194 | 0.525688 | -0.066020 | 2 |
| User7 | -3.839584 | 0.407699 | -0.567468 | -0.669786 | 0.034044 | -1.241481 | 2 |
| User8 | 3.924105 | -2.071793 | 0.466635 | -0.136173 | -0.451456 | -0.096538 | 1 |
| User9 | 0.633596 | -2.203393 | 0.438995 | 0.505821 | -1.947541 | 0.157595 | 1 |
3D Visualization of Top Principal Components
# Setting up the color scheme for the clusters (RGB order)
colors = ['#e8000b', '#1ac938', '#023eff', '#ff7c00']
# Create separate data frames for each cluster
cluster_0 = customer_engagement_pca[customer_engagement_pca['cluster'] == 0]
cluster_1 = customer_engagement_pca[customer_engagement_pca['cluster'] == 1]
cluster_2 = customer_engagement_pca[customer_engagement_pca['cluster'] == 2]
cluster_3 = customer_engagement_pca[customer_engagement_pca['cluster'] == 3]
# Create a 3D scatter plot
fig = go.Figure()
# Add data points for each cluster separately and specify the color
fig.add_trace(go.Scatter3d(x=cluster_0['PC1'], y=cluster_0['PC2'], z=cluster_0['PC3'], mode='markers', marker=dict(color=colors[0], size=5, opacity=0.4), name='Cluster 0'))
fig.add_trace(go.Scatter3d(x=cluster_1['PC1'], y=cluster_1['PC2'], z=cluster_1['PC3'], mode='markers', marker=dict(color=colors[1], size=5, opacity=0.4), name='Cluster 1'))
fig.add_trace(go.Scatter3d(x=cluster_2['PC1'], y=cluster_2['PC2'], z=cluster_2['PC3'], mode='markers', marker=dict(color=colors[2], size=5, opacity=0.4), name='Cluster 2'))
fig.add_trace(go.Scatter3d(x=cluster_3['PC1'], y=cluster_3['PC2'], z=cluster_3['PC3'], mode='markers', marker=dict(color=colors[3], size=5, opacity=0.4), name='Cluster 3'))
# Set the title and layout details
fig.update_layout(
title=dict(text='3D Visualization of Customer Clusters in PCA Space with 4 Clusters', x=0.5),
scene=dict(
xaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC1'),
yaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC2'),
zaxis=dict(backgroundcolor="#fcf0dc", gridcolor='white', title='PC3'),
),
width=900,
height=800
)
# Show the plot
fig.show()
Cluster Distribution Visualization
# Calculate the percentage of customers in each cluster
cluster_percentage = (customer_engagement_pca['cluster'].value_counts(normalize=True) * 100).reset_index()
cluster_percentage.columns = ['Cluster', 'Percentage']
cluster_percentage.sort_values(by='Cluster', inplace=True)
# Create a horizontal bar plot
plt.figure(figsize=(10, 4))
sns.barplot(x='Percentage', y='Cluster', data=cluster_percentage, orient='h', palette=colors)
# Adding percentages on the bars
for index, value in enumerate(cluster_percentage['Percentage']):
plt.text(value+0.5, index, f'{value:.2f}%')
plt.title('Distribution of Customers Across Clusters', fontsize=14)
plt.xticks(ticks=np.arange(0, 50, 5))
plt.xlabel('Percentage (%)')
# Show the plot
plt.show()
Evaluation Metrics
Silhouette Score:
The Silhouette Score assesses the quality of clustering by comparing the similarity of data points within the same cluster to data points in different clusters.
Mathematical Formula:
For each data point i:
Calculate the average distance from i to other data points in the same cluster (a_i)
Calculate the average distance from i to data points in a different cluster (b_i)
Silhouette Score for data point i: (s_i) = (b_i−a_i)/max(b_i,a_i)
The overall Silhouette Score is the mean of all individual data point scores.
Interpretation:
Ranges from -1 (poor clustering) to +1 (perfect clustering).
A score close to 1 suggests well-separated clusters.
Davies-Bouldin Index:
The Davies-Bouldin Index evaluates the average similarity between each cluster and its most comparable cluster.
Mathematical Formula:
For each cluster i:
Calculate the dissimilarity between cluster i and the cluster most similar to i.
Davies-Bouldin Index (DB) is the average of these dissimilarities: DB = (1/n)∑max(j!=i)((a_i+a_j)/d(i,j))
n is the number of clusters.
a_i is a measure of dissimilarity between cluster i and its centroid.
d(i,j) is the distance between centroids of clusters i and j.
Interpretation:
Lower numbers suggest better clustering solutions.
Calinski-Harabasz Index (Variance Ratio Criterion):
The Calinski-Harabasz Index measures the ratio of between-cluster variance to within-cluster variance.
Mathematical Formula:
Calculate the sum of squares between clusters (B) and within clusters (W):
B = ∑n_i||C_i-C||^2
W = ∑∑||x-C_i||^2
Calinski-Harabasz Index (CH): CH = (B/W).(N-K)/(K-1)
K is the number of clusters.
N is the total number of data points.
Interpretation:
Higher numbers suggest better-defined clusters.
# Compute number of customers
num_observations = len(customer_engagement_pca)
# Separate the features and the cluster labels
X = customer_engagement_pca.drop('cluster', axis=1)
clusters = customer_engagement_pca['cluster']
# Compute the metrics
sil_score = silhouette_score(X, clusters)
calinski_score = calinski_harabasz_score(X, clusters)
davies_score = davies_bouldin_score(X, clusters)
# Create a table to display the metrics and the number of observations
table_data = [
["Number of Observations", num_observations],
["Silhouette Score", sil_score],
["Calinski Harabasz Score", calinski_score],
["Davies Bouldin Score", davies_score]
]
# Print the table
print(tabulate(table_data, headers=["Metric", "Value"], tablefmt='pretty'))
+-------------------------+---------------------+ | Metric | Value | +-------------------------+---------------------+ | Number of Observations | 19 | | Silhouette Score | 0.22845646386066745 | | Calinski Harabasz Score | 5.980821108533767 | | Davies Bouldin Score | 1.0689987453375518 | +-------------------------+---------------------+
Silhouette Score: Fair
Calinski Harabasz Score: Resonable
Davies Bouldin Score: Resonable
Radar Chart Approach
To create radar charts for visualizing the centroid values of each cluster across different features, we must initially calculate the centroid for each cluster. This centroid serves as the mean value for all features within a particular cluster. Afterward, we'll display these centroids on radar charts, enabling a straightforward visualization of the central tendencies of each feature across the clusters.
# Setting 'UserID' column as index and assigning it to a new dataframe
df_customer = customer_engagement_cleaned.set_index('UserID')
# Standardize the data (excluding the cluster column)
scaler = StandardScaler()
df_customer_standardized = scaler.fit_transform(df_customer.drop(columns=['cluster'], axis=1))
# Create a new dataframe with standardized values and add the cluster column back
df_customer_standardized = pd.DataFrame(df_customer_standardized, columns=df_customer.columns[:-1], index=df_customer.index)
df_customer_standardized['cluster'] = df_customer['cluster']
# Calculate the centroids of each cluster
cluster_centroids = df_customer_standardized.groupby('cluster').mean()
# Function to create a radar chart
def radar_chart(ax, angles, data, color, cluster):
# Plot the data and fill the area
ax.fill(angles, data, color=color, alpha=0.4)
ax.plot(angles, data, color=color, linewidth=2, linestyle='solid')
# Add a title
ax.set_title(f'Cluster {cluster}', size=20, color=color, y=1.1)
# Set data
labels=np.array(cluster_centroids.columns)
num_vars = len(labels)
# Compute angle of each axis
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
# The plot is circular, so we need to "complete the loop" and append the start to the end
labels = np.concatenate((labels, [labels[0]]))
angles += angles[:1]
# Initialize the figure
fig, ax = plt.subplots(figsize=(35, 25), subplot_kw=dict(polar=True), nrows=1, ncols=4)
# Create radar chart for each cluster
for i, color in enumerate(colors):
data = cluster_centroids.loc[i].tolist()
data += data[:1] # Complete the loop
radar_chart(ax[i], angles, data, color, i)
# Add input data
ax[0].set_xticks(angles[:-1])
ax[0].set_xticklabels(labels[:-1])
ax[1].set_xticks(angles[:-1])
ax[1].set_xticklabels(labels[:-1])
ax[2].set_xticks(angles[:-1])
ax[2].set_xticklabels(labels[:-1])
ax[3].set_xticks(angles[:-1])
ax[3].set_xticklabels(labels[:-1])
# Add a grid
ax[0].grid(color='grey', linewidth=0.5)
# Display the plot
plt.tight_layout()
plt.show()
Histogram Chart Approach
We can validate the profiles identified from the radar charts by plotting histograms for each feature segmented by the cluster labels. These histograms provide a visual inspection of the distribution of feature values within each cluster, aiding in confirming or refining the profiles derived from the radar charts.
# Plot histograms for each feature segmented by the clusters
features = customer_engagement_cleaned.columns[1:-1]
clusters = customer_engagement_cleaned['cluster'].unique()
clusters.sort()
# Setting up the subplots
n_rows = len(features)
n_cols = len(clusters)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 3*n_rows))
# Plotting histograms
for i, feature in enumerate(features):
for j, cluster in enumerate(clusters):
data = customer_engagement_cleaned[customer_engagement_cleaned['cluster'] == cluster][feature]
axes[i, j].hist(data, bins=20, color=colors[j], edgecolor='w', alpha=0.7)
axes[i, j].set_title(f'Cluster {cluster} - {feature}', fontsize=15)
axes[i, j].set_xlabel('')
axes[i, j].set_ylabel('')
# Adjusting layout to prevent overlapping
plt.tight_layout()
plt.show()
#by seeing the plot we can say that cluster 2 is the best cluster
Brand Recommendation Machine: If any customer search for an item name then this machine will predict top brand(s) for him/her.
def recommend_brand(item_name):
# Filter dataset for the given item name
filtered_data = df[df['Item Name'] == item_name]
if filtered_data.empty:
return "No recommendation available for this item."
# Count the occurrences of each brand
brand_counts = filtered_data['Brand'].value_counts()
# Get the top-selling brand
top_brand = brand_counts.idxmax()
return f"Recommended brand(s) for {item_name}: {top_brand}"
# Main
item_to_search = input("Enter the item name: ")
recommendation = recommend_brand(item_to_search)
print(recommendation)
Enter the item name: Laptop Recommended brand(s) for Laptop: Thompson Group